Re: Corpora: Perplexity and corpus size

Ted E. Dunning (ted@aptex.com)
Tue, 23 Dec 1997 12:07:51 -0800

There are very close relationships between a variety of entropic
measures, the generalized log-likelihood ratio test and the chi^2
distribution. I can describe the basis of these relationships here,
but I provide substantially more information on this in my
dissertation. This sort of development can, no doubt, be found
elsewhere as well, although I have never found it in one place.

The basis of the relationship is the fact that with a few reasonable
(for our field) assumptions, the log-likelihood ratio

max_{\omega \in H_0} p(X | \omega)
\lambda = - 2 \log ----------------------------------
max_{\omega \in H} p(X | \omega)

where \omega represents the model parameters, X the observations and
H_0 is the null hypothesis parameter space, H is the entire parameter
space. Again, with a few assumptions, \lambda is asymptotically
\chi^2 distribution with degrees of freedom equal to the difference in
dimension between H and H_0.

In the case that the model is a multinomial or something very much
like a multinomial (i.e. approximately a Markov model), then \lambda
expands into terms which look generally like \sum_x k_x log k_x/k_*
where k_x is the number of times x was observed and k_* is the number
of times anything was observed. This already suggests a connection
between mutual information (and cross-entropy and all the other,
similar tests).

If we are actually dealing with multinomial distributions and our null
hypothesis is that separate sets of observations have identical
underlying distribution, then we can arrange our observations in a
contingency table. The mutual information between the various
multinomial symbols (often taken as the columns of the contingency
table), and the different sets of observations (often the columns of
the contingency table) can be easily shown to exactly equal to
K \lambda.

For more complex models than the multinomial, this sort of exact
equivalence can either be much more complex to demonstrate, but it
will still hold as long as the underlying assumptions are reasonable.
We can thus reasonably expect that the asymptotic distribution of mutual
information between two sets of such observations to be \chi^2
distributed if the null hypothesis holds. This statement pertains to
the case of held-out data as well as any other separated set of
observations.

Interestingly, other measures such as cross-entropy and perplexity can
also be brought into this argument by considering only the held-out
data. In this case, the null hypothesis typically specifies all of
the parameters so that the number of degrees of freedom is equal to
the number of parameters in the model. This number must generally
modified somewhat it is often the case that not all of the apparently
free parameters are really as free as they look. This can change the
effective degrees of freedom, but the fact remains that if the null
hypothesis holds, then the cross-entropy (or log of the perplexity)
should have be chi-squared distributed subject to scaling by the
number of observations.

Thus, if the test corpus *is* from the same distribution as estimated
from the training, we can expect the mean of the cross-entropy to
decrease proportionally to the size of the test corpus. The situation
is a bit more complex when the training set is small enough that the
variation caused by errors in estimation are comparable to the errors
caused by the small set of the test set. Ultimately, if you really
want to analyze the situation, you might consider switching to mutual
information instead of perplexity. If you make the training and test
sets roughly of equal size, then this becomes a test of the
amount of variation in estimated models.

This discussion is necessarily quite a bit more abbreviated than is
good for really understanding things. Anyone who would like more
information should please feel free to email me (or continue the
discussion on this mailing list). Other people are bound to have very
cogent contributions, especially people from the speech recognition
community where information theoretic measures of performance have
been used since the dawn of time (as it were).

>>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk> writes:

ak> Can anyone point me to results/discussions of how perplexity
ak> (and related info-theoretic measures, eg cross-entropy) vary
ak> with size of training and test corpora?