Re: Corpora: lemma vs lexeme

Paul Hays (hays@lit.sugiyama-u.ac.jp)
Fri, 05 Nov 1999 10:20:54 +0900

I did a Ph.D. with Sinclair at Birmingham in the early 90's which
revolved around this topic. At that time, there were no efficient POS
taggers and so lemmatization could not be carried out on a POS tagged
text. The task was to use an unmarked corpus and use collocations. In
fact, it became a project in disambiguation. However, there were several
aspects of lemmas, lexemes that needed to be resolved. I used
theoretical work by Firth, Halliday and Sinclair and developed the
following framework.

A lemma is a set of related morphological forms. These are related by
orthography in most cases. A lexeme is a meaning realized by a set of
forms. A lexical scatter set is the set of members of a lemma which
realize a particular lexeme.

For example, there is a lemma {water, watered, watering, waters,
watery}. There is a lexeme for a substance which is realized by the
lexical scatter set {water, waters}. There is also a lexeme for the
adding or giving of that substance which is realized by the lexical
scatter set {water, watered, watering, waters}.

This work may have been superseded by the prior marking of texts for
POS, but I think that from a theoretical stand point, parts of speech
are problematic.

Paul R. Hays
Sugiyama Women's Univeristy
Nagoya, Japan