Re: word frequency lists?

Ted Dunning (ted@crl.nmsu.edu)
Fri, 24 Nov 1995 08:46:13 -0700

Date: Fri, 24 Nov 1995 09:42:25 GMT
Reply-To: Jem Clear <jem@cobuild.collins.co.uk>
From: Jem Clear <jem@cobuild.collins.co.uk>
Sender: owner-corpora@lists.uib.no
Precedence: bulk

Ted Dunning pointed out that a word frequency list from some corpus
simply reflects the constitution of that corpus. But ... there is a
demand among writers of NLP software, language teachers,
linguistics students, lexicographers and others [for these
frequency lists] ... [users complain about them] In what sense,
though, is the data deficient?

Ted Dunning's answer is presumably that the wordlists are deficient
because they are drawn from a corpus which is not appropriate for the
task/purpose that the user has in hand.

i have more of an answer than that!

my two major objections to these lists is

a) people often use them for specialized purposes, for which
generalized lists are not well suited.

b) average frequency is a poor measure of prevalence. i suspect that
rank order statistics might well capture more of the language centric
intuitions we have. but to present a list like that, you can't just
say "the frequency of 'right' is such and so". the answer you give
has to be much more complex.

I would
bet 10 pounds that the most frequent word of the British National
Corpus is the same as that in the Bank of English and in the MAP
corpus.

right you are! "the" is number 1.

i wouldn't take that bet.

But does the rank-ordering of words remain stable through the
wordlist?

no. it doesn't. note the list that henry posted. there were some
domain and task specificities that made it up to position two.

on the other hand, words common in the brown corpus were, for the most
part, common in the map corpus. that is the key intuition which make
rank order statistics of various sorts very useful.

but when people ask me for a word frequency list, i still get the
hives.