Re: Corpora: Perplexity and corpus size

Bill Fisher (william.fisher@nist.gov)
Tue, 23 Dec 1997 13:43:13 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Ted E. Dunning: "Re: Corpora: Perplexity and corpus size"
Previous message: Jock McNaught: "Corpora: call for papers"
Maybe in reply to: Adam Kilgarriff: "Corpora: Perplexity and corpus size"
Next in thread: Ted E. Dunning: "Re: Corpora: Perplexity and corpus size"

Adam -

On Dec 23, 6:09pm, you wrote:
> Subject: Corpora: Perplexity and corpus size
>
> Can anyone point me to results/discussions of how perplexity (and
> related info-theoretic measures, eg cross-entropy) vary with
> size of training and test corpora?
>
> Adam Kilgarriff

There's a pretty good discussion of these matters in the book
"Corpus-Based Methods in Language and Speech Processing", ed.
Steve Young and Gerrit Bloothooft, Kluwer, 1997, ISBN 0-7923-4463-4.

On p. 178 Herman Ney et al. give the basic equations defining
corpus (test, test-set) perplexity. And then on p. 204 and 205
they give tables showing corpus perplexities for several different
ways of doing language modeling, measured against a test corpus of
324k words from the WSJ task, each language model having been
trained on three corpora of size 1, 4, and 39 million words.

For instance, their table 6.3 (p. 204) shows:

Training Corpora
Language modeling technique 1 M 4 M 39 M
Katz discount, CMU toolkit 250.8 163.6 102.3
Katz discount, this work 250.9 163.5 102.3
absolute discounting 248.8 163.4 102.5

- Bill F.

Next message: Ted E. Dunning: "Re: Corpora: Perplexity and corpus size"
Previous message: Jock McNaught: "Corpora: call for papers"
Maybe in reply to: Adam Kilgarriff: "Corpora: Perplexity and corpus size"
Next in thread: Ted E. Dunning: "Re: Corpora: Perplexity and corpus size"