Re: Corpora: lemma vs lexeme

Monnier (monnier@inf.enst.fr)
Wed, 03 Nov 1999 21:43:24 +0100

Hi,

> Can anyone enlighten me definitively (or refer me to a source) on
> the
> distinction between lemma and lexeme?. After reading tons of
> material
> all I can think of is that lemma is just the corpus linguists' term
> for apparently the same thing - only traditional lexicologists
> preferred the term lexeme instead.

Lexeme : In linguistic area you are talking about lexemes and in french
you have also "lexies"and other terms. Those terms are related to a
linguistric point of view on language 2nd articulation such as words,
morphemes but in terms of lexical units and not grammatical ones.Those
terms are strongly related to sens unit : you could say that you have 10
lexemes for "fly" as you can find 10 entries of "fly" in a dictionnary
or 10 different meanings.

Lemma : When in corpus analysis, I attach several forms (sequence of
alphabetical letters separate by space or ponctuation) on a same unit
the lemma. I have heard about two ways of using "lemme" :
1) More common, It's to merge together differents forms which are
different in flexions such as case, gender and number, or verbs
conjugation forms. You would have one lemma "simulation+" for the forms
{"simulation, simulations"}
2) A strongly less common uses for lemma is to catch every forms issued
from a same root. For example in a corpus, it could be usefull to attach
"simulator", "simulation", "simulate" and also "simulators",
"simulated","simulations" and other words to the same unit "simul+".

The 2nd lemmatisation process could be decribe by the term
"rootinization".

It seems to me that this term is near "stemma" which seems to be used
for english language.

> AND - does lemmatisation involve any of the disambiguations
> (POS, sense, word family assignment) or is it only surface form
> based? Do we all do the same when we 'lemmatise'? (I've also
> seen the term 'lexematisation' somewhere... Well..)

In a French content analysis tools nammed Alceste you have a module
which is called "lemmatisation". As I remember in it's 2nd version the
process was only rules based on forms in later version it used also a
dictionnary.

I would say that if you used the term of "lemmatisation" as a
functionnal term : to attach together different form on a same unit in a
corpus analysis ; if you try to find the more usefull process you will
be trying to make your lemmas more and more near lexemes and you will be
using more and more linguistics resources that won't be based only on
forms.

Philippe.

--
mailto:monnier@inf.enst.fr Philippe Monnnier mailto:phmonnier@cmc.fr
http://www.inf.enst.fr ENST Infres 46 rue Barrault 75634 Paris cedex 13
http://www.cmc.fr *CMC Dir. Rech.60, rue de Ponthieu 75008 Paris France
ENST Tél. (33)145817588 Fax ~3119 * CMC Tél. (33)156432405 Fax ~2425