Re: scaling/norming

Ted Dunning (ted@crl.nmsu.edu)
Sun, 3 Dec 1995 13:10:08 -0700 (MST)

Any text corpus is but a small sample of whatever language it happens
to be in.

true.

You can count features of the corpus to estimate the
distribution of those features in the language (or sublanguage), but
you will have only estimates.

also true.

The problem is that the accuracy of the
estimates varies non-linearly with the magnitude of the estimate. Low
counts produce much more inflated estimates than high counts.

very much so.

but it should be noted that the bootstrap method can be used to
determine (roughly) the accuracy of the probability estimates. this
cannot be done using the multinomial models (where the variance is
np(1-p)) because the deviation from the multinomial model is very
significant. there is also the problem that variance is usually
interpreted using an assumption of normal distribution which is
incorrect for small expected counts.

The problem with comparing features of corpora of different sizes is
that the frequencies will be proportional to the size of the corpus,
but normalizing w.r.t. the size of the corpus will result in skewed
probability estimates.

right. but you can make the comparison rather handily. one way is to
use the bootstrap to get confidence limits and compare using those,
and another way is to use the G^2 statistic as i recommended in a 1993
article (which dan knows about, of course).

So that's the problem. What's the solution? Smoothing.

that is one answer, but it doesn't really deal with the problem.
smoothing can give you better estimates, but it doesn't tell you
anything about how good the estimates are. you need to know how good
the estimates are in order to compare them.

... There are several varieties of smoothing. The most painless to
learn for non-statisticians is called "simple Good-Turing
smoothing,"

another fairly simple version is deleted interpolation. some of the
mercer+brown+others papers describe this approach quite well.

here are some references which describe the bootstrap, G^2, deleted
interpolation and the good-turing estimation method.

@article{dunning93,
author={Ted E. Dunning},
title={Accurate Methods for the Statistics of Surprise
and Coincidence},
journal={Computational Linguistics},
volume=19,
number=1,
year=1993,
pages={61-74},
summary={Recommends the use of G^2 for comparing frequencies}
}

@article{brown92,
author={Peter L. Brown and Stephen A. Della\ Pietra and
Vincent J. Della\ Pietra and Jennifer C. Lai and Robert L. Mercer}
,
title={An Estimate of an Upper Bound for the Entropy of English.},
journal={Computational linguistics.},
volume={18},
number={1},
pages={31-40},
year={1992},
summary={Describes the use of deleted interpolation to form a
language model which is then used to estimate a bound for the entropy
of English.}
}

@article{brown92a,
author={Peter F. Brown and Vincent J Della\ Pietra and Peter V. deSouza and
Jenifer C. Lai and Robert L. Mercer},
title={Class-Based n-gram Models of Natural Language.},
journal={Computational linguistics.},
volume={18},
number={4},
pages={467-480},
year={1992},
summary={Describes how deleted interpolation and a clustering
algorithm can produce a very good language model}
}

@article{church91,
author={Kenneth W. Church and William A. Gale},
title={A comparison of the enhanced Good-Turing and deleted
estimation methods for estimating probabilities of English
bigrams.},
journal={Computer speech & language.},
volume={5},
number={1},
pages={19-54},
year={1991},
summary={Describes how to use Good-Turing to estimate
probabilities. Unfortunately, the resulting estimates are compared
directly instead of examining the impact on a language model into
which they might be incorporated.}
}

@article{magerman95,
author={David M. Magerman and Eugene Charniak},
title={Statistical Language Learning},
journal={Computational linguistics.},
volume={21},
number={1},
pages={103},
year={1995},
summary={Points out some fine points in deleted interpolation}
}

@book{efron82,
author={Bradley Efron},
title={The Jackknife, the bootstrap and other resampling plans},
publisher={SIAM},
year={1982},
summary={Excellent treatment of the bootstrap}
}

@article{efron91,
author={Bradley Efron},
title={Statistical Data Analysis in the Computer Age},
journal={Science},
volume={253},
number={5018},
year={1991},
summary={Wonderful introduction to the bootstrap}
}