Re: Corpora: MWUs and frequency

James L. Fidelholtz (jfidel@siu.buap.mx)
Wed, 7 Oct 1998 09:27:23 -0500 (CDT)

On Wed, 7 Oct 1998, Jean Hudson wrote:

>Ted Dunning is right in saying that "frequency lists for single words are
>highly suspect, especially below roughly the thousandth most common word.

Well, yes and no. There is certainly a lot to investigate with
respect to the 1000 or so most frequent words, but there is also a lot
to look at with respect to infrequent words (personally, I'm interested
in the low freq. end). In a 1975 Chicago Linguistic Society article, I
refer to these as 'familiar' and 'unfamiliar', respectively, and present
evidence that, at least for English, the dividing line (or at least A
dividing line) is at about 5 occurrences/M (based on the Thorndike-Lorge
1940s count). I never actually counted, but I would guess this puts a
little less than 10K words in the 'familiar' category (eg, 'astronomy'
[ca. 5/M] is relatively familiar, 'gastronomy' [ca. 1/M] is unfamiliar).
Also, the confidence level of the frequency order (of course, taking
into consideration the selection of the texts, etc.) goes up with more
texts. So the figure of '5/M' based on a million word corpus is much
less certain than a figure of '5/M' based on an 18M word corpus, or,
say, a gigaword corpus.
Selection is of course difficult, but WE as speakers/readers are
faced with that problem in our daily lives, so as researchers we have to
face the same problem, and just 'grin and bear it'. Eg, some words are
'naturally' underrepresented in the data (eg colloquial words like
'berserk').

>Interpreting the data is another matter. I'd say that even the most
>frequent words are suspect, viewed as single words.

Well, yes, but even when you subtract the MWU occurrences,
they'll still be way up there.

>Finally, what does it mean that an MWU is frequent?

Interesting suggestions. I guess the point is that the
confidence you can have in the 'frequency' of a given word or unit
really depends a lot on what you are doing with the data.
Sorry that I have answered a different question than the one you
asked, but it is related, and the whole topic is interesting.
Jim

James L. Fidelholtz e-mail: jfidel@siu.buap.mx
Maestri'a en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO