Re: comparisons in text corpora: keywords / CHI square

Ted E. Dunning (ted@aptex.com)
Fri, 30 Aug 1996 10:38:19 -0700

... In our study I implemented Log-likelihood alongside the
chi-squared value. For most of the words we were interested in
(relative frequency above 0.005%) the difference between the
chi-squared value and the log-likelihood was at most 3%. Possibly
this problem didn't occur as we were comparing roughly equal size
subcorpora or the BNC.

that sounds like the correct explanation. i suspected that this might
be the case.

to amplify what paul is saying, with the example i used before:

word A other words
+---------------------
corpus 1 | 1 999
corpus 2 | 1 999999

chi^2 should not be applied since it give pretty bozoid results.

the situation that paul rayson is talking about, however, is bit more
like the following:

word A other words
+---------------------
corpus 1 | 150 1000000
corpus 2 | 1000 10000000

In this case, Pearson's chi^2 gives a score of 21.74 while the
log-likelihood ratio gives 19.40. Clearly, these are much more
comparable in this case. This shows how in certain kinds of
corpus frequency comparisons, the traditional chi^2 measure is
perfectly fine.

It should be noted that even though this level of association is
virtually impossible to have happened by chance, single cell mutual
information gives a score of only 0.52 (compared to the not terribly
exceptional case above where it gave a score of 8.97).