Re: scaling/norming

Dan Melamed (melamed@unagi1k.cis.upenn.edu)
Thu, 30 Nov 1995 12:13:44 -0500 (EST)

Any text corpus is but a small sample of whatever language it happens
to be in. You can count features of the corpus to estimate the
distribution of those features in the language (or sublanguage), but
you will have only estimates. The problem is that the accuracy of the
estimates varies non-linearly with the magnitude of the estimate. Low
counts produce much more inflated estimates than high counts.

The problem with comparing features of corpora of different sizes is
that the frequencies will be proportional to the size of the corpus,
but normalizing w.r.t. the size of the corpus will result in skewed
probability estimates. To see the problem at work, you can run the
following experiment:

1) Divide your NS corpus into two parts, label them "control" and
"test."

2) Randomly pick a bunch of words from the control half, with frequencies of 1,
2 and 3.

3) Divide by the size of the control half to arrive at estimated
probabilities of occurrence for those words.

4) Measure the accuracy of the estimates by finding the actual
frequencies in the test half.

5) Find the avarage accuracy for all words of frequency 1 in the
control, then the average for frequency 2, then 3.

If you pick a large enough sample of words, you will see that the
frequency 1 words are the worst estimators, and the frq-2 words are
second worst.

So that's the problem. What's the solution? Smoothing. At least
that's the best solution on "the market" right now. There are several
varieties of smoothing. The most painless to learn for
non-statisticians is called "simple Good-Turing smoothing," and is
described in a paper by Gale & Sampson in an (upcoming?) issue of the
Journal of Quantitative Linguistics. I believe there are also some
preprints floating around on the net. If you want to really dive in
the deep end, just look up "smoothing" in the statistics section of
your library.

Dan Melamed
UPenn

> What's the doctrine on comparing corpora of different sizes?
>
> I want to compare features (wds, n-grams, POS tags etc) from a corpus of .5
> mil words of the writing of NS speakers of English to a 750,000 wd corpus of
> the writing of NNS speakers. I've been told that proportional or scaled
> comparisons is inadvisable (presumably since wd freqs can't be predicted
> proportionally (because of Zipf's curve??). Am I left with no alternative
> but to throw away materials from the smaller corpus?
>
> John Milton
> HKUST
>