Re: Corpora: estimate of grammatical sentences

Ted E. Dunning (ted@aptex.com)
Wed, 27 Aug 1997 15:02:31 -0700

my strong assertion would be that the best empirical estimate would be
based on the perplexity of the best-known probabilistic language
models. there is the problem that "well-formed sentences" is not
really a well formed concept. "reasonably complete grammars" are
typically neither reasonable nor complete nor grammars. instead they
tend to be much more complex data structure which incorporate all
kinds of semantic information and other constraints.

as a rough estimate, the perplexity for an interpolated trigram model
of english reported some years ago by brown, mercer, et al. was about
240 or so. since perplexity is roughly analogous to average fanout in
the word lattice, you could say that there are roughly 240^N sentences
of length N. of course, you will have a hard time reconciling
counting with a probabilistic model which assigns non-zero probabilies
to all strings of words. furthermore, this is a Very Large Number and
is almost certainly an over-estimate.

perhaps a closer estimate would be to take shannon's original estimate
of just more than 1 bit of entropy per letter in English. with an
average of 5 letters per word, this gives about 6 bits of entropy per
word, or a perplexity which gives the estimate of 64^N sentences of
length N. this is still a Very Large Number, but it has the minor
virtue of being much smaller than the first number.

>>>>> "sj" == Stephen Johnson <johnson@cucis.cis.columbia.edu> writes:

sj> Given a language with a vocabulary of W words, what is a rough
sj> approximation of the number of well-formed sentences of length
sj> N or smaller? (Well-formedness is determined by any
sj> reasonably complete grammar of the language of your chosing.)

sj> Clearly the number of grammatical sentences is much smaller
sj> than W^(N+1). What is a better approximation? Does anyone
sj> have an empirical method for estimating this number using
sj> corpus-based techniques?