Re: scaling/norming

Ted Dunning (ted@crl.nmsu.edu)
Tue, 5 Dec 1995 15:42:15 -0700 (MST)

Am I right in thinking, then, that there is no problem where we are
estimating parameters on the basis of more than 5 instances in the
corpus?

no.

there are two problems here. one is the situation where you are
trying to actually estimate parameters based on observations. this
wasn't what i was talking about in the CL paper. the other situation
is where you have observed 5 (or more) instances of some phenomenon
and are trying to determine if this is a surprise. this second
situation is more along the lines of what i was talking about.

in the first situation, using the observations to make some sort of
estimate may be just fine. you won't necessarily be able to put much
faith in your estimated value, but you can definitely find a
reasonable estimated value. using something like the bootstrap will
even let you put some confidence bounds on your estimate.

in the second situation, you are probably going to be in trouble. i
say this because if you are even potentially considering 5 instances
an interesting observation, you probably expected far fewer.
(language tends strongly to be a positive phenomenon; if you see
something, you probably see it more than you would have expected).

the key here is the expectation, not the observation.

to make this more concrete, suppose that we have seen the word tiger
524 times in a corpus of 26 million words total (whch is the situation
in the 1987 wall street journal).

now if i look at another piece of text and see 7 occurences of tiger
in a mere 1,000 words, i might say to myself that this seems
surprising. but i still have to be careful because i would only have
expected to see 0.02 instances of tiger in those 1,000 words. this
expected count is much less than 5.

(So, it is legitimate to use Mutual Info if I'm only going to be
looking at items where the frequency of the collocate is 5 or more.)

no. this falls under situation number 2.

but you *can* use the bootstrap to get confidence limits on the mutual
information. but it is generally easier to screen for significant
collocations and then rescore using whatever measure you care to use.
this will give you a list of items which significantly co-occur, and
also tell you how strongly they appear to co-occur.

Isn't this likely to be the case for John Milton's work?

no. here agin, i would recommend the likelihood ratio test (also
known as G^2).

But if he restricts himself to looking at items that occur more
than 5 times in each corpus, a straight comparison of
occurrences-per-thousand-words in the NS and NNS corpora is
legitimate. Yes?

no. we are talking collocation here so that the populations being
compared are *much* smaller than the corpus as a whole.

Next question: does this form of objection to using normal
approximations relate in any way to "deviations from Poisson" (as
in Church and Gale's recent work)

i think not. my feeling is that these deviations from poisson are
exactly what the likelihood ratio test is ferreting out.

the poisson is the continuous limit of the binomial. if the size of
the corpus is large, then the binomial and poisson will give
essentially identical results. if not, then they won't. with
collocations, the effective size of the corpus for one row of the
contingency table being analyzed is equal to the number times one of
the words occurs. this is *far* too small in many cases to consider
using the poisson. for other situations, the poisson is very nice.

(I don't think so, because the log-likelihood test is based on the
binomial distribution - a one-parameter distribution: the
Church-Gale work (as well as my own) is looking at the inadequacy,
for various purposes, of one-parameter distributions. But I'm
getting out of my depths, and seriously need help...)

if you don't like one parameter distributions, then using the
multinomial can help. it is interesting, however, that if you are
only interested in the frequency of a single word, then using the
binomial likelihood ratio test is equivalent to the multinomial test.
in fact, similar results can be derived for a wide class of Markov
models and probabilistic decision tree models. concisely put, when
examining frequencies of a single word in a model with disjoint
contexts, you might as well use the binomial likelihood ratio test
since it will be equivalent to the full blown likelihood ratio test.

it should be noted that the fact that IDF weighting is a rough measure
of deviation from the single Poisson model is well known in the IR
community. in the IR case, you generally have 1 instance of a term in
a very short query so that the likelihood ratio (for either the
identical Poisson or binomial null hypotheses) is essentially
dependent only on the corpus counts.