Re: Corpora: lemma vs lexeme

Marco Antonio Esteves da Rocha (marcor@cce.ufsc.br)
Thu, 4 Nov 1999 06:19:00 -0600 (CST)

On Wed, 3 Nov 1999, Przemyslaw Kaszubski wrote:

> Hello,
>
> Can anyone enlighten me definitively (or refer me to a source) on
> the
> distinction between lemma and lexeme?. After reading tons of
> material
> all I can think of is that lemma is just the corpus linguists' term
> for apparently the same thing - only traditional lexicologists
> preferred the term lexeme instead.

I don't think this is true. When the word *lemma* is used, it should refer
to the set of tokens under a *lexeme*. Thus, when the text encoding
operation called *lemmatisation* is carried out, it means reducing the
words in a corpus to its lexemes. More specifically, the *lexeme* "sing"
is normally used as the head word of the lemma which includes the
variants "sing", "sings", "sang", "sung".

> > AND - does lemmatisation involve any of
the disambiguations
> (POS, sense, word family assignment) or is it only surface form
> based? Do we all do the same when we 'lemmatise'? (I've also
> seen the term 'lexematisation' somewhere... Well..)
>

You may have found literature or text in messages which use the term
"lemmatisation" in a loose way, but I would be surprised to see the
operation called "lemmatisation" refer to the grouping of nouns and verbs
of the same surface form, such as "beat", under the same lemma. As I
understand it, all of us think of lemmatisation as an operation carried
out on a POS-tagged text, even if you have a program that does it all in a
go apparently.
Disambiguation involving sense is not usually included in the operation
called lemmatisation. So, concerning this kind of disambiguation, it could
be said to be based on surface form. Nonetheless, whenever I hear the
phrase "surface form only", I get a bit suspicious, as there is no
guarantee that human processing of language is not based on the
distributional features of surface forms in collocations, even for
complex processing, such as sense disambiguation and anaphora resolution.
Personally, I would be inclined to support this view, although I
haven't been working in psycholinguistics lately. Perhaps you meant
"surface forms in isolation" ?

> I've browsed previous discussions on lemmas in the
corpora list > archive, but the above doubts have been left unanswered...
>

I hope the above helps.

Marco Rocha
marcor@cce.ufsc.br