Re: Corpora: Summary of POS tagger evaluation

Thorsten Brants (thorsten@CoLi.Uni-SB.DE)
Tue, 9 Feb 1999 16:52:00 +0100 (MET)

Adam.Kilgarriff@itri.brighton.ac.uk wrote:
> The 90%/10% strategy implies that
> training and test data are of identical text-type.

Absolutely correct. One can go one step further: experiments
using this strategy yield results for a particular _source_, since
using another source for the same type of text may change results (e.g.
different newspaper differ in style even if they report on the same
topics).

> This makes sense
> for developer-oriented evaluation, but overestimates the performance
> of a system in real-life situations, when it will *not* be the case
> that training data is of exactly the same type as the data on which
> the algorithm will be used in earnest.

If you are lucky to find a tagger that is trained on the same source
that you will be using, then the 90%/10% strategy is very close to the
real-life application (or if you have tagged data from the intended
source and a tagger that is trainable).

> In most uses in earnest, the
> tagger will be applied to a stream of texts, some of which do not exist
> yet, and the properties of which we can only guess at, on the basis of
> the samples we have so far.

One implication of the above is the necessity to adapt a tagger to a
particular source in order to avoid errors stemming from cross-domain or
cross-source usage (BTW: everybody tries to avoid cross-language errors
by training and testing on the same language :-).

Adaptation may include extensions of the lexicon, parameter
re-estimation using data from the target source, ...
Yet, for all taggers that I am aware of, this requires either large
amounts of data or an expert user.

-Thorsten