Corpora: POS tagging vs. lemmatization

Diana Sousa Marques Pinto Dos Santos (diana.santos@ilf.uio.no)
Mon, 9 Feb 1998 11:06:58 +0100 (MET)

On the discussion about relative difficulty of POS tagging and
lemmatization, I would like to note that lemmatization does not
exhaust itself on previous POS attribution, as Kilgarriff and Ridings
seem to imply:

Even after a correct POS has been attributed there can be quite a job
until you know which lemma you should choose, and this independently
of the cases mentioned by Ridings on theoretical problematic cases (or
lack of agreement on which is the best case).

Some examples from English:
"saw" has been appropriately tagged as verb. Which is the
lemma: SEE or SAW? (It is true, you may have already decided if it was
past or present, based on statistical considerations that actually
involved the different frequencies of SEE and SAW in the first place,
but then you were also doing implicit lemmatization)

"lying" has been appropriately tagged as verb in gerund. Which is the
lemma, LIE-LIED... or LIE-LAY... Assuming that you would want to
consider the two verbs as two different lemmas(?)

This may not be a big problem for English, but for morphologically
more complex languages it definitely is, as our experience with
Portuguese shows.
For example,
"vendo" can be 1st person singular of present indicative of VENDER
1st person singular of present indicative of VENDAR
gerund of verb VER

"venda" can be 1st/3rd person singular of present subjunctive of VENDER
3rd person singular of present indicative of VENDAR
singular of noun VENDA (actually, two different nouns ("selling" and "veil"), but with no morphological difference, so no problem for a lemmatizer)

etc.

This shows - I think - that for Portuguese, and probably for all
languages with a rich morphology - even if you include in POS tagging
morphosyntactic disambiguation (and not only POS attribution),
lemmatization still remains a problem.

In other words, that you need full syntactic processing - and
sometimes even semantic information - to get the correct lemma. Not
only POS tagging (or morphological tagging), thus!

--
This remark does not in any way refute Schulze's
and Ridings's claims on the appropriateness of the IMS Corpus
Workbench to get at those lists, provided they have been encoded
with the corpus.

Diana ------------------------------------------------------------------------ Diana Santos Tel: +47-22 85 71 10 The Text Laboratory E-mail: diana.santos@ilf.uio.no Department of linguistics Fax: +47-22 85 69 19 University of Oslo http://www.uio.no/~dianasa/ P.O.box 1102 Blindern N-0317 Oslo, Norway http://www.hf.uio.no/tekstlab/ ------------------------------------------------------------------------