Re: Corpora: POS tagging vs. lemmatization

Eckhard Bick (lineb@hum.aau.dk)
Tue, 10 Feb 1998 12:33:16 +0100

As to the morphosyntactic tagging of Portuguese as an example of a
morphology-rich language, - I fully agree with Diana Santos: PoS is not
enough for lemmatization, and syntactic function and morphology/PoS are
heavily interdependent in any full analysis. Even for Portuguese,
though, automatic analysis is not too difficult, at least for a rule
based system (which is what I know about), since the additional
morphological details do not only increase the tag set (and thus, make
things more difficult), but also provide more specific context for
disambiguation (and thus, make things easier). Interestingly, overall
morphological ambiguity in Portuguese before contextual disambiguation
is similar to what is described for English (roughly 2 readings per
word, on average).

I have written a Constraint Grammar based automatic tagger/parser for
Portuguese, which handles both levels with low error rates, if anybody
is interested. At near 100% precision, recall is over 99% for
morphology/PoS, and 97-98% for flat syntax (both numbers are obviously
tag set dependent). For 'venda', its finds a fourth morphological
reading, the imperative singular. The parser is on the web, as part of a
multi-lingual teaching tool:

http://visl.hum.ou.dk/Linguistics.html

Similar systems exist for other languages, like English and German
(lingsoft's engcg and gercg, links at the same site and at lingsoft.fi).

Eckhard

-- 
Eckhard Bick, cand.med., cand.mag.
web: http://ling.hum.aau.dk/~eckhard/index.html
work: Dpt. of Linguistics, Århus University, Tel. +45 89422131
home: Rugbjergvej 98, DK-8260 Viby J, Tel. +45 86283524, Fax. 1397