Re: Corpora: MWUs and frequency

Bruce L. Lambert (lambertb@uic.edu)
Fri, 09 Oct 1998 11:02:35 -0500

(I originally sent this to Ted directly, but I meant it to go out to the
whole list.)

Ted Dunning said:
>But what I was trying to say had more to do with the futility of using
>such frequency sorted lists as generalizations. The features that I
>pointed out and that you pointed out demonstrate exactly this point.
>Essentially all of these points of interest are due *precisely* to the
>specific nature of the text that I analysed. The fact that the
>particular nature of the text I used is this prominent is a strong
>argument *against* the general utility of such frequency sorted lists
>of collocates.

The question of whether frequency lists can be 'trusted' or used as the
basis for generalizations is a very important one in several areas of
psycholinguistics. As many of you probably know, one of the most durable
findings in psycholinguistics is the so-called word frequency effect. In a
wide variety of tasks (lexical decision, speeded naming, perceptual
identification), higher frequency words are responded to more quickly and
accurately than lower frequency words. Explaining this phenomenon is the
central task for most current models of word recognition.

Of course, just what counts as a high-frequency word or a low frequency
word depends on the source of the frequency data. Most psychologists rely
on the Kucera and Francis counts.

In my own work, I am interested in the effect of frequency on the
perceptibility of drug names. Here frequency means prescribing frequency,
an even more slippery notion than printed or spoken frequency. Prescribing
frequency data are hard to obtain (except for a very high price). Those
that are available from gov't sources are only reliable at the very highest
end of the frequency scale (because of large sampling error).

If frequency counts from a single corpus are not to be trusted as the basis
for generalization, what does this imply for frequency-based theoretical
generalizations such as the ones I described above? Relatedly, how well do
frequency counts from different large corpora correlate?

-bruce

Bruce Lambert, PhD
Acting Head and Director of Graduate Studies
Department of Pharmacy Administration
University of Illinois at Chicago
833 S. Wood St. (M/C 871)
Chicago, IL 60612-7231

phone: 312-996-2411
fax: 312-996-0868