I describe in my 1993 CL paper a different (but related) measure which
avoids most of the small count problems that are encountered in corpus
analysis. The generalized log-likelihood measure that I advocated
performs more effectively (subjectively) for finding multi-word units.
In unpublished work with Paul Mineiro here at Aptex, I have extended
this measure with results which appear to be better even than the
original log-likelihood measure.
Others have advocated measures such as the Dice coefficient. In a
comparative study, Beatrice Daille compared a variety of association
measures against a hand-annotated list and found that the
log-likelihood ratio test came out closest to human judges
performance. She evalued quite a number of different measures in her
work, but exactly which ones I cannot recall.
The WordSmith tools implement a number of these measures including my
log-likelihood ratio test. I also am happy to provide software source
code to calculate a number of these measures to those interested in
investigating these measures.
Hope this helps.
@Article{ChurchHanks90,
author = "Kenneth W. Church and Patrick Hanks",
title = "Word Association Norms, Mutual Information, and
Lexicography",
journal = "Computational Linguistics",
pages = "22--29",
volume = "16",
number = "1",
year = "1990",
}
@article{dunning93,
author={Ted E. Dunning},
title={Accurate Methods for the Statistics of Surprise
and Coincidence},
journal={Computational Linguistics},
volume=19,
number=1,
year=1993,
pages={61-74}
}
@TechReport{
Daille:95,
author = "B\'{e}atrice Daille",
title = "Combined Approach for Terminology Extraction:
lexical statistics and linguistic filtering",
institution = "{UCREL}, Lancaster University",
year = 1995,
number = 5,
summary={Daille does
some interesting emprical checks to see which test for "termhood"
works best against a hand-annotated "gold-standard", and
log-likelihood comes out best, so there's empirical vindication for
the theoretically correct answer, for those of us who need such
reassurance}
}
>>>>> "rj" == Rosie Jones <rosie@nl.cs.cmu.edu> writes:
rj> Instead you should look at the ratio of multi-word
rj> co-occurrence frequency, compared to the frequency of the
rj> individual words separately. Thus if you rank multi-word units
rj> by
rj> freq(word1 next to word2)
rj> -------------------------
rj> freq(word1) * freq(word2)
rj> you will get something which will rank "hot dog" above "in
rj> the" without any need for stop-lists. You can extend this to
rj> arbitrary numbers of adjacent words.