Re: word frequency lists?

Ken Church (kwc@research.att.com)
Mon, 27 Nov 95 10:10 EST

A small point in favor of frequency lists...

In trying to come up with frequency lists for bigrams and trigrams I
find that when the corpus size hits 100,000 words I run out of memory
on the computer. While I might be able to tweak my program and get
that number up to 200,000 or maybe 500,000 (doubt it) I think the
system limitations here will prevent me from coming up with bigram
and trigram counts for a 1,000,000 word corpus.

So...if someone with much greater computing resources than I has come
up with bigram and trigram frequency lists I'd love to hear about
it. It would be ideal if such counts were available for the ACL/DCI
WSJ corpus as that is the corpus I've been working with.

Regards
Ted

--
* Ted Pedersen pedersen@seas.smu.edu *
* http://www.seas.smu.edu/~pedersen/ *
* Department of Computer Science and Engineering, *
* Southern Methodist University, Dallas, TX 75275 (214) 768-3712 *

Over the years, I've spent a fair bit of time thinking about efficient
ways to count things like this, but for a mere million words or so, it
really isn't worth the effort. I'm sure you can do it yourself, even
on a modest PC. People used to do these kinds of calculations on a
PDP-11, which is much more modest in almost every respect than
whatever computing resources you are currently using.

I assume that you have limited memory, but plenty of disk space.
Let's suppose that you were willing to buy a 100 megabytes ($50) for
the experiment. Then, you ought to be able to do something like this:

echo "this is a large corpus -- to be or not to be -- to be or not to be " |
tr ' ' '\12' |
awk '{x=y; y=z;z=$1; print x, y, z}' |
sort |
uniq -c

1 this
1 this is
2 -- to be
1 a large corpus
1 be -- to
2 be or not
1 corpus -- to
1 is a large
1 large corpus --
2 not to be
2 or not to
1 this is a
1 to be
1 to be --
2 to be or

This program should just work as is under almost any Unix system. If
you have DOS, I'd recommend that you get some sort of package like
mkstools.

Alternatively, the basic idea is so straightforward, you could
probably recode it in your favorite language in a few hours, at most.
The only hard part is to use an external sort, since I'm assuming you
don't have enough physical memory for an internal sort. (Even this
assumption is questionable; you'd need about 15 megabytes of memory
for an internal sort, and that's not such a big deal anymore.)

Estimate of 100 megabytes: Suppose the input is a million words or 5
megabytes. The output of the awk step should be 3 times larger, or 15
megabytes. The output of the sort is another 15 megabytes (but the
temp files could double the requirement). And the output of the uniq
should be even less. So, I would expect that you'd need something
like:

5 megabytes for input
5 megabytes for tr
15 for awk
30 for sort
10 for uniq
--
65 total

I then round this up to 100 megabytes, figuring that I probably forgot
about something or other. You could do with less if you were running
an operating system that supported real pipes (Unix), or if you spent
a little time thinking about it, but the 100 megabytes is so cheap ($50),
why bother?

I wouldn't bring out the big guns (fancy machines, fancy algorithms,
data collection committees, bigtime favors) unless you had a lot more
text (e.g., 100 million words or more), or you were trying to count
really long ngrams (e.g., 50-grams).

I'd suggest that you try to do these things yourself for basically the
same reason that home repair stores like DIY and Home Depot are as
popular as they are. You can always hire a pro to fix your home for
you, but a lot of people find that it is better not to, unless you are
trying to do something moderately hard.

Similar arguments probably apply for counting unigrams (word
frequencies) as well. On the other hand, though, if we all used the
same standard tables, then it would be easier to compare results from
one study to the next. But on the other hand, as many has pointed
out, there are also reasons to be concerned that standard tables are
never quite right for any particular study. But as in ``Fiddler on
the Roof,'' there are no other hands...

We'd probably be better off if most people could try it both ways.
Sometimes the standard list is better, and sometimes a custom list is
better. We'll never know which is right for a particular application
unless most of us have the skills and resources to try it both ways
for a while. At least that is an empiricist's response to the debate.

Ken Church