Re: word frequency lists?

Ted Dunning (ted@crl.nmsu.edu)
Thu, 23 Nov 1995 10:45:13 -0700

henry thompson approximately wrote:

================================================================
Here's the list from the HCRC Map Task Corpus, roughly 150,000 tokens
of task-oriented dialog:

12863 the 1863 i 1403 you're
6866 right 1779 down 1300 no
4653 you 1702 yeah 1223 left
4347 to 1685 got 1094 side
3988 of 1621 about 1078 well
3454 and 1619 so 1016 at
3390 a 1613 that 986 is
2580 go 1601 then 976 on
2445 okay 1591 up 953 from
2102 it 1544 have 928 round
2093 just 1437 {gg|uh-huh} 895 do
================================================================

henry has eloquently made the point that frequency lists of words are
almost useless if they are taken from a domain different from the one
under consideration. in this list, the most common word (not
surprisingly) is "the". this is normal in english. but the second
most frequent word is "right". the map reading task involves one
person trying to give directions to another, so this is essentially a
domain based deviation from the english "norm". so are "go", "okay",
"just", "down", "yeah", "got", "up", "{gg|uh-huh}", "no", "left",
"side", "at", "on", "from", and "round".

it is not that good information cannot be had from such counts, it is
just that the information that can be had is much less universal than
many people would think.

moreover, this coin of domain specificity has another side. it also
invalidates counts taken from the so-called "balanced" corpora such as
the brown corpus or the british national corpus. by conjoining data
from diverse sources, an average count is obtained which might be
supposed to be better in some sense than the counts obtained from any
domain specific source.

this is not true, however. the act of balancing has created a corpus
which is utterly unlike any real bit of text. thus the frequency
counts taken from such a balanced corpus cannot be taken as a
characterization of any real text. these counts may, perhaps, be used
to highlight the deviations in a particular sample, but even this use
is subject to serious error.

the moral?

do your own counts on material appropriate to the task at hand!