Re: word frequency lists?

Lothar Lemnitzer (lothar@hendrix.uni-muenster.de)
Fri, 24 Nov 1995 11:02:22 +0100

Lothar Lemnitzer

Arbeitsbereich Linguistik der Westf"alischen FAX : +251 83 83 43
Wilhelms-Universit"at M"unster Tel.: +251 83 91 47
H"ufferstr. 27
D-48 149 M"unster
e-mail: lothar@hendrix.uni-muenster.de

Fa. ZERES GmbH Tel.: +234 970 75 12
Universit"atsstr. 142 FAX.: +234 970 75 75
D-44 799 Bochum

** The two most common things in the universe are hydrogen and stupidity **
_______________________________________________________________________________

Richard Piepenbrock wrote:

> Yes, caution should be used even with 'balanced' or 'representative'
> corpora, but such corpora do have their use for applications based on the
> degree of familiarity of language users in general with certain words,
> which for educated speakers would be an amalgam of words from the spoken
> and written medium. I am thinking of studies of the mental lexicon, the
> compilation of learner's dictionaries and general-purpose dictionaries
> (including spelling checkers).

I have been working quite a while now in bilingual lexicography, compiling and c
leaning, among
other things, our corpus-derived lemma lists (lemmatisation had been performed s
emi-automatically).

My experienmce is that corpus derived frequency lists are helpful for lemma sele
ction, but need a
lot of cleaning up and adjusting. A few examples:

* in our German newspaper corpus the Lemma "Bundesministerium" is very frequent,
whereas "Gabel" (fork)
is not. Leaving out "Gabel" (while inserting "Messer" - knife, and "Löffel" -
spoon) is unacceptable
for a dictionary. And, to take Richard's point, "Gabel" is very familiar acco
rding to
psycholinguistic experiments I have seen so far.
* in the lower ranks of the German frequency list you find a lot of odd compound
s (e.g.
"Fernsehhochleistungstheoretiker" - sorry, I cannot tranlate that)
* The character of our material leads to a clear overrepresentation of football/
soccer terminology ind
German and of baseball/Cricket/golf-terminology in English
* Newspaper texts are full of proper names which quite often interfere with gene
ral nouns (problem of
homography, in particular in German with capitalization of both general and p
rober nouns -
examples are "Kohl" (cabbage and Prime Minister) and "Dienstbier" ("a beer du
ring work" and, of
course solely - the (former) Czech Foreign Minister)

According to my experience so far, a core vocabulary of 25 000 to 30 000 lexemes
can be well derived
from a corpus-based frequency list. Above that limit one needs a clear strategy
for the orientation of
the dictionary - additional lemma selection from sublanguage corpora or careful
selection from the
lower rank items in a general language corpus.+

Another mismatch which might not concern lexicographers but NLP researchers is t
he clear
underrepresentation of full forms in the verbal paradigm. There are very few 2nd
person and very few
subjunctive forms in German journalese - this is quite obvious but I am sure tha
t there are more
subtle gaps which one should be aware of BEFORE starting corpus based linguistic
research.

LL