Re: word frequency lists?

Alex Chengyu Fang (ucleacf@ucl.ac.uk)
Mon, 27 Nov 1995 11:10:07 +0000

Ted wrote,

>b) average frequency is a poor measure of prevalence. i suspect that
>rank order statistics might well capture more of the language centric
>intuitions we have. but to present a list like that, you can't just
>say "the frequency of 'right' is such and so". the answer you give
>has to be much more complex.

At 11:39 PM 26/11/95 -0500, Judith Klavans wrote:

>My understanding of Ted's comment (and my own opinion) is that
>he is not denying the usefulness of ``general'' or ``balanced''
>corpora, but is simply pointing out some oddities of the data
>ht sent out. Indeed, deviation from the norm is a common way
>of determining e.g. topic, domain, etc. But the establishment
>of ``the norm'' or the baseline is what he was commenting on.

>It's not as easy a task as it seems; one might have a difficult
>time judging when one has attained the ``right balance''.
>However, the alternative is not to have to collect specific
>corpora for each app, but simply be aware of deficiencies and
>limitations.

I agree with Ted that frequency lists (ranked or not) can be misleading,
especially those generated from so-called balanced or representative
corpora, IF the frequencies don't reflect the composition of these corpora.

I've been following the discussion but no one seemed to have mentioned John
Carroll's work in the American Heritage Intermediate Dictionary Corpus. His
rank order list was made not according to the absolute or average frequency,
but an value adjusted by the distribution index which reflects the
cross-category distribution of a particular lexical item. Unevenly
distributed words were adjusted down in the list.

--------------------------------------------------------------
Alex Chengyu Fang E-Mail: ucleacf@ucl.ac.uk
Survey of English Usage Voice: 0171 380 7777 Ext. 3120
University College London 0171 419 3120
Gower Street, London WC1E 6BT Fax: 0171 916 2054
--------------------------------------------------------------