Corpora: Tagging is not...

Rodolfo Delmonte (delmont@helios.unive.it)
Tue, 16 Feb 1999 09:59:18 +0100

Hi all, I sympathize strongly with all the people who contributed to the
current debate and circulated summary about POS taggers. However I must say
that it does not fit into our framework, both theoretical and
implementative. Actually it sounds flawed by a number of false
premises/presuppositions.
We assume that tagging cannot be/is not to be regarded as a
self-contained/self-sufficient processing task/module: we regard it as just
the first important module/process in a wider and deeper text processing
system. We also assume it must be in a strict feeding relationship with a
syntactic (shallow) parser/chunker, which is then used either for text
understanding and generation/summarization or for other such more complex
tasks.
Since tagging cannot be regarded as an end in itself, restrictions on its
output should be targeted to the goals of tagging is intended for, i.e. it
should respect//obey the following criteria:
A. It should be over 99% correct, or the error rate should be below 1%
- errors are here intended as unknown words to the Guesser which cannot
be tagged either as
proper names, nor foreign words;
B. It should be generative, in order to be adaptable/cope with different
domains/genres/
corpora
- this means that the tagger is actually a morphological analyser with
linguistic rules, a
root dictionary, and a list of affixes of the language and constraints
to the generation
process, and not a list of wordforms with one single tag;
C. It should produce lemmata, which is a trivial task with morphologically
poor languages like
English or Chinese, but not so trivial with all remaining languages (and
we work on
Italian)
- lemmata are essential in alla tasks of semantic/conceptual information
retrieval;
D. It should allow for subcategorization information to be encoded in
verbal tags, to serve
further processing modules. It should incorporate a minimum of efficient
and necessary
semantic information in tags requiring it in order to produce sensible
tagging disamb-
iguation: i.e. temporal nouns, commong nouns, human beings nouns, proper
nouns etc.
E. Disambiguation should be syntactically targeted and pragmatically
constrained on the basis
of genre/corpus type: word "La" in Italian is a clitic pronoun, a
definite article and a
common noun (meaning the A note) but this latter tag is rare or specific
to a certain
domain and only with initial uppercase "L".

In case it obeys all the above general criteria, evaluation of tagger
performance could be
organized as follows:
- percentage of unknown vs out of vocabulary words as a percentage of the
total number of
tokens (punctuation excluded - even though punctuation is important for
disambiguation!).
Out of vocabulary words include automatically labeled proper nouns on the
basis of initial
uppercase letter, legally guessed words by means of generative processes,
but also wrongly
tokenized items and other misspelled words. In an experiment we carried
out on a corpus of
1 million words of Italian we got 5000 unknown words over 35000 out of
vocabulary words,
i.e. 14%, and we assume that a good guesser should approximate that rate;
- number of ambiguously tagged tokens vs unambiguous ones to test the
Disambiguator efficiency
rate. This proportion is language dependant and in some measure, also
possibly pragmatically
variable in the sense specified above.
- number of tagging errors should be weighted by number of tags in the set.
However we feel
that errors should always be related to the type of syntactic framework
the tagging is
target for. Errors should also be weighted by the level of ambiguity as
measured by the
previous count.
More could be said on the use of HMMs and other statistical approaches on
automatic tagging processors: even though the use of probability measures
are important in increasing the efficiency of the Disambiguator, we feel
that HMMs are not suitable for use in tasks different from the oens for
which they have been conceived, i.e. speech analysis and recognition. The
reason is very simple, neither bigram nor trigram models are sufficient to
encode the complexity of syntactic structure. As to our Disambiguator, we
use augmented Finite-State-Transducers.
This is a rather long message, more to come in case the debate continues!!
R.D.

**********************************************************
rodolfo delmonte Ph.D.
Associate Professor of Computational Linguistics
Section of Linguistic Studies
Ca' Garzoni-Moro, San Marco 3417
Universita' Ca' Foscari
30124 - VENEZIA (It)
tel.:39-41-2578464
lab.:39-41-2578452/19
fax.:39-41-5287683
website: http://byron.cgm.unive.it
**********************************************************