lemmatized word lists

Lou Burnard (lou@vax.ox.ac.uk)
Tue, 25 Feb 1997 14:58:24 +0000

|Gunnel Kallgren wrote:
|
|"This is another example of Anglosaxon ethnocentrism. This statement holds
for English,
|not for languages with complex inflectional morphology. To lemmatize all
|the words in a
|large corpus is a major undertaking that takes time and resources and can
|never be satisfactorily
|done in a wholly automatical way. To ask for lemmatized frequency lists for
| such languages is
|neither silly nor lazy."

And Debbie Krause further wrote:

|I completely agree with Gunnel!! The same argument goes for languages which
|do not use overt word delimiters (such as spaces) at all (e.g., Japanese,
|Chinese). It is impossible to count words in a Japanese or Chinese corpus
|without word-breaking it first, and there is not yet a wholly automatic
|word-breaker that is 100% consistent and accurate. One should not trifle
|with the problems of the languages he/she may not speak.

I don't think Jacques Guy was being particularly anglocentric, nor do I think
that his arguments implied any trifling with the problems of other languages.
The point originally made related to simple word-form frequency lists, and said
nothing about lemmatized lists. Even the most xenophobic of anglophones would
readily agree that lemmatizing a word form list for a substantial corpus is a
major undertaking, no matter what language it's in. Moreover, it's an
undertaking on which theoretical principles differ. All the more reason, in my
view, for being deeply skeptical about frequency lists that purport to say
something objective about the language as a whole -- no matter what the
language is.

Lou Burnard

Debbie Krause
Software Linguist
Inso Corporation