Re: Corpora: MWUs and frequency; try Relative Frequency

Ted E. Dunning (ted@aptex.com)
Fri, 9 Oct 1998 14:15:51 -0700

What Ms. Jones is advocating is an association ratio. The log of this
quantity has (misleadingly) been referred to in the literature as
mutual information. Church and Hanks investigated this measure and
found that using it directly to investigate word association was
unsatisfactory. Much of the problem with association ratios is that
rare words which appear in proximity are given anomalously large
scores.

I describe in my 1993 CL paper a different (but related) measure which
avoids most of the small count problems that are encountered in corpus
analysis. The generalized log-likelihood measure that I advocated
performs more effectively (subjectively) for finding multi-word units.

In unpublished work with Paul Mineiro here at Aptex, I have extended
this measure with results which appear to be better even than the
original log-likelihood measure.

Others have advocated measures such as the Dice coefficient. In a
comparative study, Beatrice Daille compared a variety of association
measures against a hand-annotated list and found that the
log-likelihood ratio test came out closest to human judges
performance. She evalued quite a number of different measures in her
work, but exactly which ones I cannot recall.

The WordSmith tools implement a number of these measures including my
log-likelihood ratio test. I also am happy to provide software source
code to calculate a number of these measures to those interested in
investigating these measures.

Hope this helps.

@Article{ChurchHanks90,
author = "Kenneth W. Church and Patrick Hanks",
title = "Word Association Norms, Mutual Information, and
Lexicography",
journal = "Computational Linguistics",
pages = "22--29",
volume = "16",
number = "1",
year = "1990",
}

@article{dunning93,
author={Ted E. Dunning},
title={Accurate Methods for the Statistics of Surprise
and Coincidence},
journal={Computational Linguistics},
volume=19,
number=1,
year=1993,
pages={61-74}
}

@TechReport{
Daille:95,
author = "B\'{e}atrice Daille",
title = "Combined Approach for Terminology Extraction:
lexical statistics and linguistic filtering",
institution = "{UCREL}, Lancaster University",
year = 1995,
number = 5,
summary={Daille does
some interesting emprical checks to see which test for "termhood"
works best against a hand-annotated "gold-standard", and
log-likelihood comes out best, so there's empirical vindication for
the theoretically correct answer, for those of us who need such
reassurance}
}

>>>>> "rj" == Rosie Jones <rosie@nl.cs.cmu.edu> writes:

rj> Instead you should look at the ratio of multi-word
rj> co-occurrence frequency, compared to the frequency of the
rj> individual words separately. Thus if you rank multi-word units
rj> by

rj> freq(word1 next to word2)
rj> -------------------------
rj> freq(word1) * freq(word2)

rj> you will get something which will rank "hot dog" above "in
rj> the" without any need for stop-lists. You can extend this to
rj> arbitrary numbers of adjacent words.