Re: Corpora: phrase (n-gram) frequency information

Chris Brew (Chris.Brew@edinburgh.ac.uk)
Tue, 29 Jun 1999 09:21:45 +0100

>Hello the list! Does anyone have information to offer on the most
>common English phrases in use in a given body of text? That is, what
>4-word, 5-word (10-word, whatever) phrases appear most frequently in the
>Bible, in Shakespeare, in Tom Clancy novels, in newspapers, in any known
>corpora? Any information on this would be greatly appreciated.
>
>Thanks.
>
>David Sarokin
>sarokin.david@epa.gov
>202-260-6396
see:

Slava Katz (1996)
Distribution of content words and phrases in text and language modelling
Journal of Natural Language Engineering 2(1): 15-59

for a variety of unexpected facts about the way in which words and\
phrases are distributed. The headline claim is that the probability
of repeat occurrences of a word in a document does not depend on the
relative frequency of the word in a larger corpus. Phrases are also
discussed.