Re: word frequency lists?

Jem Clear (jem@cobuild.collins.co.uk)
Fri, 24 Nov 1995 09:42:25 GMT

Ted Dunning pointed out that a word frequency list from some
corpus simply reflects the constitution of that corpus. But I
agree with the subsequent comments from Piepenbrock that there
is a demand among writers of NLP software, language teachers,
linguistics students, lexicographers and others for lists of
words which reflect their perceptions concerning the "centrality"
or "coreness" of lexical items. At Cobuild we are constantly
being asked for word frequency information from the Bank of
English -- I find it interesting that when we provide straight
frequency data the recipient often reports back to us that the
lists are "faulty" in some way! E.g. "too many Briticisms in the
lists" (the corpus is 70% British sources!), "too many placenames,
personal names and other rubbish", "can't tell whether the
high frequency of 'right' is due to its use as a discourse marker",
and so on. In what sense, though, is the data deficient?

Ted Dunning's answer is presumably that the wordlists are deficient
because they are drawn from a corpus which is not appropriate for the
task/purpose that the user has in hand.

I share the intuitive feeling that there is some validity in the
notion that there are frequent, core words of English at one end of a
scale and other rare, specialist, peculiar words at the other end --
and I think it valuable research to attempt to model this by
collecting large corpora and studying lexical patterns. What is surely
needed is <emphasis>more research</emphasis> on word frequency in
relation to text types, corpus size, sampling procedures, etc. I would
bet 10 pounds that the most frequent word of the British National
Corpus is the same as that in the Bank of English and in the MAP
corpus. But does the rank-ordering of words remain stable through the
wordlist? If it does then we may be getting close to this core
vocabulary -- and if not then there is no option but to build specific
corpora for each application area and task.