I have just examined the distribution of words in a transcribed corpus
of conversational speech and (surprisingly) found it to be very
different to text.
The log-log graph of rank (X) vs frequency (Y) seems to assymptote to
-1.5ish for speech, but, as is well known, only -1.1ish for text.
This clearly has major implications for the number of words you have
to consider to get, say, 98% coverage.
This result must be well-known (in fact Redington found it earlier for
"child-speak" from the CHILDES corpus of mother/child interactions,
but I thought is was peculiar to that type of discourse; apparently
not!). Could anyone enlighten me on where I could read about it
(confirmation, disconfirmation, implications for ramblings on the
a-priori necessity of Zipf's law, etc)?
Cheers,
Steve
------------------------------------------------------------------------
When you steal from one person, it's called plagiarism;
When you steal from many, it's research. -- Wilson Misner
------------------------------------------------------------------------
Steve Finch http://www.thomtech.com/nlp/steve.html
Thomson Labs/NLP | sfinch@thomtech.com
1375, Piccard Drive, | +1 301 548 4093 (voice)
Rockville, MD, 20850 | +1 301 527 4080 (paper)