Re: Corpora: Perplexity and corpus size

Bill Fisher (william.fisher@nist.gov)
Tue, 23 Dec 1997 13:43:13 -0500

Adam -

On Dec 23, 6:09pm, you wrote:
> Subject: Corpora: Perplexity and corpus size
>
> Can anyone point me to results/discussions of how perplexity (and
> related info-theoretic measures, eg cross-entropy) vary with
> size of training and test corpora?
>
> Adam Kilgarriff

There's a pretty good discussion of these matters in the book
"Corpus-Based Methods in Language and Speech Processing", ed.
Steve Young and Gerrit Bloothooft, Kluwer, 1997, ISBN 0-7923-4463-4.

On p. 178 Herman Ney et al. give the basic equations defining
corpus (test, test-set) perplexity. And then on p. 204 and 205
they give tables showing corpus perplexities for several different
ways of doing language modeling, measured against a test corpus of
324k words from the WSJ task, each language model having been
trained on three corpora of size 1, 4, and 39 million words.

For instance, their table 6.3 (p. 204) shows:

Training Corpora
Language modeling technique 1 M 4 M 39 M
Katz discount, CMU toolkit 250.8 163.6 102.3
Katz discount, this work 250.9 163.5 102.3
absolute discounting 248.8 163.4 102.5

- Bill F.