Corpora: estimate of grammatical sentences

Stephen Johnson (johnson@cucis.cis.columbia.edu)
Wed, 27 Aug 1997 13:45:56 -0400

I am interested to find out what is known about the following
combinatoric question about natural language sentences:

Given a language with a vocabulary of W words, what is a rough
approximation of the number of well-formed sentences of length N or
smaller? (Well-formedness is determined by any reasonably complete
grammar of the language of your chosing.)

Clearly the number of grammatical sentences is much smaller than
W^(N+1). What is a better approximation? Does anyone have an
empirical method for estimating this number using corpus-based
techniques?

-- 

-Stephen B. Johnson, Ph.D. -Associate Professor -Department of Medical Informatics -Columbia University