Re: Corpora: Summary of POS tagger evaluation

José Gabriel P. Lopes (gpl@di.fct.unl.pt)
Wed, 10 Feb 1999 14:23:56 +0000

There are is no such thing as unambiguous words. For instance in: "the word
the is an article", the second "the" is a proper noun. Probably one should
write it surrounded by ' ' , but in real text we will face many occurrences
of similar phenomena: lign 5, document 18999, function f, letter c, ... A
single 'comma' can be just be a puntuation sighn as well as a coordinated
conjunction or a proper noun. Of course it depends on the kind of texts one
is looking at.

sorry for the interruption.

Best regards,

Gabriel Pereira Lopes

Thorsten Brants wrote:

> O.Mason@bham.ac.uk wrote:
> > This raises an issue which is slightly more complex: if you exclude
> > punctuation (presumably on the grounds that a comma is always tagged
> > as `comma' and there is no ambiguity), why include other unambiguous
> > tokens in the scoring? If `the' always gets assigned `DET', and no
> > other tags for it are possible, then why count it and not the comma?
>
> one reason for _not_ excluding unambiguous words is sparse data: how do
> you know that a word is unambiguous? Just that is has only one tag in
> the lexicon is not sufficient because the correct tag may not be listed.
>
> If you exclude unambiguous words from scoring, you really would need two
> different accuracy results in order to describe the performance of a
> tagger: one for ambiguous words, the other one for ``unambiguous''
> words.
>
> -Thorsten