Re: requests for corpora

Deborah Krause (Deborah_Krause@inso.com)
Tue, 25 Feb 1997 09:26:51 -0400

Gunnel Kallgren wrote:

"This is another example of Anglosaxon ethnocentrism. This statement holds
for English,
not for languages with complex inflectional morphology. To lemmatize all
the words in a
large corpus is a major undertaking that takes time and resources and can
never be satisfactorily
done in a wholly automatical way. To ask for lemmatized frequency lists for
such languages is
neither silly nor lazy."

I completely agree with Gunnel!! The same argument goes for languages which
do not use overt word delimiters (such as spaces) at all (e.g., Japanese,
Chinese). It is impossible to count words in a Japanese or Chinese corpus
without word-breaking it first, and there is not yet a wholly automatic
word-breaker that is 100% consistent and accurate. One should not trifle
with the problems of the languages he/she may not speak.

Debbie Krause
Software Linguist
Inso Corporation