Absolutely correct. One can go one step further: experiments
using this strategy yield results for a particular _source_, since
using another source for the same type of text may change results (e.g.
different newspaper differ in style even if they report on the same
topics).
> This makes sense
> for developer-oriented evaluation, but overestimates the performance
> of a system in real-life situations, when it will *not* be the case
> that training data is of exactly the same type as the data on which
> the algorithm will be used in earnest.
If you are lucky to find a tagger that is trained on the same source
that you will be using, then the 90%/10% strategy is very close to the
real-life application (or if you have tagged data from the intended
source and a tagger that is trainable).
> In most uses in earnest, the
> tagger will be applied to a stream of texts, some of which do not exist
> yet, and the properties of which we can only guess at, on the basis of
> the samples we have so far.
One implication of the above is the necessity to adapt a tagger to a
particular source in order to avoid errors stemming from cross-domain or
cross-source usage (BTW: everybody tries to avoid cross-language errors
by training and testing on the same language :-).
Adaptation may include extensions of the lexicon, parameter
re-estimation using data from the target source, ...
Yet, for all taggers that I am aware of, this requires either large
amounts of data or an expert user.
-Thorsten