Re: comparisons in text corpora: keywords / CHI square

Ted E. Dunning (ted@aptex.com)
Thu, 29 Aug 1996 11:14:30 -0700

[how do you determine when word frequencies are different?]

I have a paper forthcoming in which we used Chi-squared comparison
on vocabulary in the British National Corpus (spoken part):

unless you have relatively high counts which is relatively unusual in
most cases where you are looking at word frequencies (your case may be
an exception), then chi^2 is a very bad choice for comparing
frequencies. the situation where it is particularly bad is when you
are looking at the frequency of a word in a relatively small corpus
relative to the frequency in a much larger corpus. in this case chi^2
can easily overstate the significance of small differences in
frequency by several hundred orders of magnitude.

as an example, suppose you have a test sample of 1000 words in which a
word occurs 2 times, while in the reference sample this word occurs 1
time in a million words. to analyze this situation, we would
construct a table with the relevant counts:

1 999
1 999999

if we use chi^2 to analyze this, we get a score of 499 which
indicates that the situation we have observed is astronomically
unlikely.

if instead we use the log likelihood ratio test (sometimes called G^2)
that i advocated in my computational linguistics article of some years
ago, then we get a score of 11.05 which indicates that this situation
would be expected to occur once out of thousand trials. this is
clearly a much more reasonable estimate of the situation, and if we
were looking at 1000 or more different words in this way, we would not
consider this score at all surprising.

another score which has been advocated for this type of purpose is the
single cell mutual information (as opposed to the average mutual
information which is a much more common measure in information
theory). ken church is generally credited with popularizing this
measure in computational linguistics circles. unfortunately, this
score has some strong drawbacks. these oddities make the use of the
raw score untenable for many cases. for instance, in the case given
above, the single cell mutual information gives a respectably large
score of 8.97. but in a case where the difference is massively more
pronounced

10 1
1 10

the single cell mutual information test gives a very low score of 0.86.

not only does single cell mutual information give very surprising and
counter intuitive results like this, but the scale on which it gives
results is not easily linked to statistical significance. this means
that interpretation of the significance of a number of observations
must be done using resampling techniques which are very
computationally intensive.

i can make software available on request which computes all of these
scores. this software is part of a comprehensive suite of word
counting and comparison programs. all of the programs are written in
C.

copies of my original paper advocating the use of the log likelihood
ratio test is available via my old home page which is at

http://crl.nmsu.edu/users/CRLfolks/dunning.html