Corpora: wordcounts

David Carlson (carlson@po.mdu.ac.jp)
Thu, 15 Apr 1999 09:51:58 +0900

I have a question about how various programs count words-

I am aware that different programs will give different word counts depending
on what the programs consider a word.
However, when I ran three different programs on the same file, I got rather
different results even for 10 ten function words: "the," (16,321 vs. 15,852
vs. 15,872 tokens) "of," "and," etc.

TRIAL #1 (Using WordSmith)
THE 16,321 6.10
OF 11,578 4.33
AND 9,273 3.47
IN 6,007 2.25
TO 5,354 2.00
A 4,543 1.70
WERE 3,241 1.21
WITH 3,167 1.18
WAS 2,952 1.10
FOR 2,535 0.95

TRIAL #2 (Using MonoConc)
15852 6.1944% the
11277 4.4067% of
9026 3.5271% and
5816 2.2727% in
5170 2.0203% to
4400 1.7194% a
3143 1.2282% were
3077 1.2024% with
2855 1.1156% was
2464 0.9628% for

TRIAL #3 (Using Eric Johnson's WORDS)
15,872 the
11,290 of
9,039 and
5,845 in
5,226 to
4,422 a
3,152 were
3,079 with
2,859 was
2,471 for

And not much farther down the list, the order of lexical items is different.
Q: What is it about the way these (and other) programs identify and tally
words that could differ so?

I also constructed my own test sentence:
"The aim of this test is to count the number of occurrences of the word
"the" in the file THE.TXT using several different word-counting programs."
This contains 6 occurrences of "the:" The (1); the (3) "the"(1) and THE (1).
I next copied the same sentence without leaving a space between the first
sentence and the second (...programs.The aim of...). I copied the above 4
more times. (So far, 60 tokens.) I added to that the following sentence, all
in caps "THE SAME SENTENCE OCCURS 10 TIMES, BUT EVERY OCCURRENCE IS WITHOUT
A SPACE AT THE BEGINNING." (2 more tokens of THE). Finally, I copied this
10x. By my calculation, this should include 620 tokens of "the." MS-Word
"FIND" also counted 620 occurrence.

WordSmith identified 620 occurrences; MonoConc 620; and WORDS 450.

D. Carlson
JAPAN
- - - - - - - - - - - - - - - - - - - - - - - - - -
MY TEST TEXT (then copied 10x)
The aim of this test is to count the number of occurrences of the word "the"
in the file THE.TXT using several different word-counting programs.The aim
of this test is to count the number of occurrences of the word "the" in the
file THE.TXT using several different word-counting programs. The aim of this
test is to count the number of occurrences of the word "the" in the file
THE.TXT using several different word-counting programs.The aim of this test
is to count the number of occurrences of the word "the" in the file THE.TXT
using several different word-counting programs. The aim of this test is to
count the number of occurrences of the word "the" in the file THE.TXT using
several different word-counting programs.The aim of this test is to count
the number of occurrences of the word "the" in the file THE.TXT using
several different word-counting programs. The aim of this test is to count
the number of occurrences of the word "the" in the file THE.TXT using
several different word-counting programs.The aim of this test is to count
the number of occurrences of the word "the" in the file THE.TXT using
several different word-counting programs. The aim of this test is to count
the number of occurrences of the word "the" in the file THE.TXT using
several different word-counting programs.The aim of this test is to count
the number of occurrences of the word "the" in the file THE.TXT using
several different word-counting programs. THE SAME SENTENCE OCCURS 10 TIMES,
BUT EVERY OCCURRENCE IS WITHOUT A SPACE AT THE BEGINNING.