Lemmatize -- Summary

Ray Liere (lierer@mail.CS.ORST.EDU)
Fri, 10 Nov 1995 17:25:06 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Nancy Ide: "Call: Word Sense Disambiguation"
Previous message: Rovny Ferenc: "Course in Computational Lexicography"

A few days ago, I posted to corpora:
= I am relatively new to natural language processing -- I am doing
= research in the application of machine learning to information
= retrieval.
=
= I have encountered the word "lemmatize", but no definition. I believe
= (from context) that it is a very generalized form of stemming ... ?
= But I would like to know specifically what it is and some details
= about common techniques for lemmatization.
=
= I have tried to find out this information on my own, but am obviously
= looking in all of the wrong places. The few references I did find
= assume that one knows the details, and argue whether or not it is good
= compared to other methods, etc.
=
= Could someone familiar with lemmatization provide a definition, or example,
= or suggest a good reference?
=
= Email if you prefer, and I will post a summary.

I have summarized the responses that I received below. My thanks to everyone
that sent information -- it has been very helpful. It is gratifying to
receive so much help! I especially noticed the comparison, as I happened
at about the same time to post a different inquiry to another mailing
list ... and got only one response. Thanks to everyone for sharing
their knowledge.

Ray Liere
lierer@mail.cs.orst.edu
============================
>From: Thomas Bilgram <bilgram@ling.hum.aau.dk>

A method for doing this is done by Lingsoft INC. (Helsinki, Finland).
Have a look at http://www.lingsoft.fi
============================
>From: Ela Dura <sveed@svenska.gu.se>

The input to a lemmatizer is a text word, the output is a lexical
form of a word (basic form) and a description of the grammatical
form of the text word. For instance, "wrote" would be "write" plus
indicative active preterite imperfect. A soft introduction in:
Oostdijk Nelleke. 1991. Corpus Linguistics and the Automatic Analysis
of English. Amsterdam: Rodopi.
============================
>From: Gregory Grefenstette <Gregory.Grefenstette@Xerox.fr>

Lemmatization, as we use it at Xerox, means reducing
the surface form of the word to its canonical dictionary
entry form.

Ex.
'screeching, screeches, screeched,' and 'screech' lemmatize to 'screech'
'were' lemmatizes to 'be'
'dogs' lemmatizes to 'dog'
============================
>From: LJUNGM@engelska.su.se (Magnus Ljung)

About the terms LEMMA and LEMMATIZATION:
There are a couple of definitions that have been used but to most
people in the field - certain the ones I know - lemmatization is
simply the subsuming of inflected forms under a 'head word' i.e. what
dictionaries generally do.
============================
>From: Adam Kilgarriff <ak28@it-research-institute.brighton.ac.uk>

lemmati{sz}e = find the lemma, eg the dictionary headword.
A simple process for English (eg stemming + a few spelling rules + a few
exceptions) but not for more morphologically complex languages.
============================
>From: olenc@coco.ihi.ku.dk (Ole Norling-Christensen)

In our vocabulary this word has two meanings:

1. (dictionary making) You lemmatize a word (or a multiword unit) when you
decide to make it a headword (lemma) in your dictionary.

2. (corpus/text work) You lemmatize a running word (a token), or a
multiword unit, when you assign a lemma to it. According to criteria
like meaning, syntactic function etc., different instances of the
same _type_ (i.e. word form) may be assigned different lemmas, e.g.
(financial) 'bank' / 'bank' (of a river); 'work' (noun), 'work' (verb).
Multiword units may pose special problems: in 'has always been',
'has..been' may be lemmatized as an inflectional form of 'be'.

Some authors separate the process in two: lemmatization is the assigning
of all possible lemmas to each word form (type) - by morphological
analysis and/or dictionary look-up, after which, as a separate process,
follows a part-of-speech disambiguation which will select the most
plausible analysis.

The lemmatizing-2 may be done automatically or by hand og by some
intermediate procedure - typically one by which the machine gradually
is learning what you expect it to do.

Now, what _is_ a lemma? - in principle, you decide. It depends on
what you shall use the lemmatized text for, which set of word classes
and which grammar you want to apply, etc. etc.
============================
>From: Ken Beesley <Ken.Beesley@Xerox.fr>

There are terminological swamps to negotiate in natural-language processing.

"Stemming" is often used to refer to rather primitive "Porter-stemmer"
approaches that rip off letters that look like suffixes and
give you back the remainder as the "stem".

"Lemmatization" or "baseform reduction" takes a word and returns
a baseform or "citation form" that, by convention, represents
the whole Lemma.

The different surface forms "loving," "loved", "loves" and "love" all
belong to the same verb lemma. By convention, this lemma is represented
in dictionaries under the citation form "love". In French, the infinitive
form "aimer" is, by tradition, the citation form for the dozens of forms
representing the lemma meaning "to love". Although Latin has a cognate
infinitive form amare, the convention is to store verbs in dictionaries
under the 1st person singular present indicative form, e.g. amo ("I love").

So, a typical "stemming" program might take "loving" and give you back
"lov" as the stem (by a simple process of lopping of the -ing). "love"
and "loves" might be stemmed back to "love". "go" and "went" will have
no common stem at all. "thought" will not be mapped back to "think", etc.

A typical "lemmatizer" or would take any form of the lemma "to love" and
return the baseform "love". It will take "went" and return "go". It will
take "thought" and return "think" (for the verb reading).

A typical "morphological analyzer" will return the baseform, like a lemmatizer,
and typically other information like part of speech, tense, mood, number,
person, etc. E.g. the Spanish word "amas" might be returned as

Baseform: amar
Tense: Present
Mood: Indicative
Person: 1
Number: Singular

A good lemmatizer or morphological analyzer will also be able to handle all
the irregular words correctly.

This terminology is not standard. I work for XSoft (a division of Xerox) which
builds and sells morphological analyzers, which can also easily be stripped
of the grammatical information so that they become the lemmatizers that I
describe above. But my bosses often refer to these lemmatizers as "stemmers,"
whereas I would like to distance them from the primitive "stemmers" that just
chop off anything that looks looks like a suffix.

At XSoft, we use Xerox Finite-State Morphology to build our products. The
results are small and extremely fast. They can also be run "backwards" to
do generation. We have products for English, French, German, Spanish,
Portuguese, Dutch and Italian, and more languages will be developed
next year.

For more information on the XSoft lexical products, contact
Daniella_S._Russo.OSBU_North@xerox.com
============================
>From: oliver@clg.bham.ac.uk (Oliver Jakobs)

Lemmatization is the reduction of inflected forms to a common base form,
eg
runs, ran, running -> run
bigger, biggest -> big

The base form is usually the infinitive for verbs, the positive for
adjectives and the nominative case for nouns.

As opposed to stemming, lemmatization is linguistically motivated (and
therefore more difficult to accomplish). Common techniques employ
lists of endings and rules how to deal with them, a good example is in
a book by Mary Dee Harris, I think the title is "Natural Language
Processing" or something similar. The difficulties lie in ambiguities,
eg "does" can be a form of both "do" and "doe".
============================
>From: maria@ling.ed.ac.uk

the lemma (if I remember it correctly) is the part of a word serving
as a dictionary entry. So for example, if you have an inflected
word, you first have to lemmatize it before you look it up.

A good reference might be any older book on computational linguistics
or for that matter computational morphology/lexicography.
Can you read German?
============================
>From: gfowler@indiana.edu (George Fowler)

Lemmatization means uniting all forms of a word in the compilation of
concordances. This would put "is", "are", "be", "were", "was", "been" (and
maybe some others!) into one entry. This is close to trivial for English,
but much harder for extensively inflected languages like Russian (my own
experience) or Georgian.
============================
>From: Judith Klavans <klavans@cs.columbia.edu>

In the Encyclopedia of Artificial Intelligence, there is a complete
entry for Morphology which gives examples in many languages, esp.
European languages but also Oriental, etc. (not just English)
If you cannot get this, I can send you a reprint.

The entry is ``Morphology'', authored by Judith Klavans and
Evelyne Tzoukermann. We cover some theoretical and applications
oriented aspects of morphology.

The article is not long (10 pages) and to the point.
============================
(end of file)

Next message: Nancy Ide: "Call: Word Sense Disambiguation"
Previous message: Rovny Ferenc: "Course in Computational Lexicography"