Re: word frequency lists?

Ted Dunning (ted@crl.nmsu.edu)
Mon, 27 Nov 1995 11:06:58 -0700 (MST)

In trying to come up with frequency lists for bigrams and trigrams I
find that when the corpus size hits 100,000 words I run out of memory
on the computer. While I might be able to tweak my program and get
that number up to 200,000 or maybe 500,000 (doubt it) I think the
system limitations here will prevent me from coming up with bigram
and trigram counts for a 1,000,000 word corpus.

noo..... tell me it isn't so.

you can count trigrams for as large a corpus as you like if you have
time and disk space. there are several approaches which are vary
somewhat in speed/programming effort. i know that with my counting
package, you should be able to handle corpora of 10^6 words without
much difficulty if you have even a moderate sized workstation.

a) count in batches of about 10^6 words for a 64MB machine. then sort
and merge the resulting counts.

b) as in a, but avoid the sort and depend on the characteristics of
the counting program and do the merge based on the hashing order.
this will save *substantial* amounts of time.

c) write a custom counting program which maintains a cache of counts.
when the cache fills to a preset level, scan a disk copy of all
counts, augmenting them by the amount in the cache. for extra speed,
the disk copy can be kept in some sort of ISAM database such as a
berkeley db. this would be faster than scanning all previous counts
since there will be so many items which are seen only once.

d) use a text retrieval system which retains full positional
information. in addition to the inverted index, you will need a list
of all bigrams. using this list you can explore the space of n-grams
relatively efficiently as you like. note that producing an inverted
index is only somewhat harder than counting the words in a corpus, and
that, once produced, this index can be used to produce counts for
*any* particular n-gram regardless of length.

with a and b, you can use any counting program. my own programs are
based on a very simple in memory hash table implementation which is
pretty darned fast. i have implemented a version of c, but haven't
tuned it very much yet. it currently runs about 3-4 times slower than
the hashed counter and doesn't do the cache flush which is needed for
counting trigrams.

all of the software that i have developed is freely available. good
text retrieval systems are also available for research purposes. the
most notable of these are probably the prise system from NIST and the
smart system from cornell.