Re: comparisons in text corpora: keywords / CHI square

Ted Pedersen (pedersen@seas.smu.edu)
Thu, 29 Aug 1996 17:37:31 -0500 (CDT)

Ted Dunning writes:

> unless you have relatively high counts which is relatively unusual in
> most cases where you are looking at word frequencies (your case may be
> an exception), then chi^2 is a very bad choice for comparing
> frequencies. the situation where it is particularly bad is when you
> are looking at the frequency of a word in a relatively small corpus
> relative to the frequency in a much larger corpus. in this case chi^2
> can easily overstate the significance of small differences in
> frequency by several hundred orders of magnitude.
[rest of message deleted]

Ted Dunning's message hit the nail on the head. To add another angle
to what Ted has already written, as yet another alternative to chi^2
testing you may want to look at exact tests such as Fisher's Exact
test or the exact conditional. These tests are designed to be used
with very sparse and skewed samples.

We have two papers that address these tests. The first appeared at
AAAI-96 and is called "Significant Lexical Relationships". It focuses
on the exact conditional test and why you would want to use it rather
than chi^2 tests. The second paper is called "Fishing For Exactness"
and it goes into some detail on using Fisher's Exact test. Both papers
are available from my home page.

There is of course the related issue of if you really want to be doing
significance testing at all (both the chi^2 tests and the exact tests
get lumped into the significance test category). Are you really taking
a random sample from a population? Are you really trying to infer how
likely it is that a null hypothesis is true in that population that
you sampled from? This is what is implied by the use of a significance
test.

I sometimes think that in NLP we are more engaged in a descriptive
enterprise. We are given some chunk of data and we take it apart and
describe it. An analogy would be an elementary school that keeps
statistics on the weights and heights of all the students at the
school. The objective is just to describe the students in that school
- not to make any broader inferences.

But the point of using a significance test is that it allows you to
make inferences about a larger population that you have a random
sample from. The analogy in the real world is that of election
polling. Given a random sample of voters determine the preference of
the whole voting population.

So should we be using descriptive or inferential methods in NLP? Both
are used. A recent paper in Computational Linguistics by Frank
Smadja uses the Dice Coefficient to find interesting word pairs. A
paper at AAAI-96 by Ellen Riloff uses a relevance rate to find
interesting patterns in text. These are both descriptive approaches.
Our AAAI paper, Ted Dunning's CL paper, and an earlier paper by Church
et. al use significance tests to do the same sort of thing.

So which is more appropriate? Is your favorite corpus of text really a
random sample of some larger population of language? If it isn't is it
still ok to use an inferential statistic as a descriptive measure? Do
simpler (to compute at least) descriptive measures do just as well in
picking out interesting word combinations as do significance tests?

Well these are the sorts of questions that I've kicked around in my
head to no resolution. I'd be very interested to hear what others
think of these issues. Some thoughts on the inferential
vs. descriptive question are recorded in a Technical Report called
"What to Infer from a Description", also found on my web page.

Regards,
Ted

-- 
* Ted Pedersen                     pedersen@seas.smu.edu              * 
*                                  http://www.seas.smu.edu/~pedersen/ *
* Department of Computer Science and Engineering,                     *
* Southern Methodist University, Dallas, TX 75275      (214) 768-3712 *