Re: Corpora: Summary of POS tagger evaluation

Andrew Harley (aharley@cup.cam.ac.uk)
Tue, 09 Feb 1999 11:53:53 +0000

At 12:34 PM 09/02/1999 +0100, Thorsten Brants wrote:
>
>> Seeing my summary from over a year ago re-posted, I thought I had better
>> update it with some of our more recent findings. We tested more taggers,
>> and found that the best performers were the CLAWS tagger from Lancaster
>> University and the ENGCG tagger from Lingsoft, although none of the tested
>> taggers scored in the supposed standard 95% + range (at least not to our
>> scoring criteria).
>
>It would be very interesting to see your scoring criteria. Could you please
>send a description or a pointer to a description?

Nothing too complex:

(1) Punctuation tags are not considered in the scoring! It is amazing how
many taggers treat a punctuation mark as a token in the scoring.

(2) Ambiguous (or unknown) codings are not permitted. Only one tag is
allowed per word. For example, with ENGCG, we took the first tag given.
Unknown tags are counted as wrong.

(3) The test data was not of the same type as the training data. In fact,
the test data was meant to be general text from the Cambridge International
Corpus covering all varieties and types of English. Better results are
inevitably achieved if you use the same type of text (e.g. the standard
90%-10% divide you mention).

(4) This is where we were perhaps a bit unfair. We had strict criteria for
the results (e.g. verb participles used attributively tagged as attributive
adjectives) but were not able to supply large amounts of test data tagged
in that way. I have found that many taggers seem happy to tag verb
participles as that regardless of their function in the sentence. A similar
problem is with those tagsets that combine preposition and subordinating
conjunction into one tag.

Andrew Harley
Systems Development Manager - ELT Reference
Cambridge University Press

Direct line: (01223)325880
Fax: (01223)325984

http://www.cup.cam.ac.uk/elt/reference