Evaluating taggers

David Graddol (D.J.Graddol@open.ac.uk)
Thu, 13 Jul 1995 11:58:10 +0200

Bill Teahan asks how he can evaluate how good his tagger is. I'm
sure people who have experience in building, training and testing
taggers will be able to help him. I'd just like to make a couple
of observations as a tagged-corpus _user_.

The first (prosaic) point is to mention the British National
Corpus which provides 100 million words of material tagged for
POS. But of course it is not yet licenced for use outside Europe,
which will be frustrating for Bill.

The second point is that the use of other tagged corpora merely
means that you're testing one tagger against another. There is no
way of avoiding, at some point, manual checking, and correction
of at least a subset of a corpus. That is, I take it, how
existing taggers are 'trained' - by resubmitting a corrected
version of their output.

But what interests me more is the _status_ of POS tagging. As
taggers become more freely available, and POS tagging more robust,
so the kind of data getting submitted to them becomes more
heterogeneous. For example, tagging works tolerably well now for
printed text (which has been designed to conform to limited
'standards' of construction). But most taggers fall over pretty
dreadfully when they're asked to deal with transcripts of spoken
language. There are four problems, as far as I can see.

(1) Spoken transcripts may contain orthographically 'distressed'
material which the tagger will not recognise in its look-up
tables. The same element may appear in different parts of the
transcript in different surface form. Transcripts also typically
contain interpolated material (such as that describing context or
the speech event)

(2) Spoken transcripts are often inaccurate which leads to
incorrect assignment of POS. Sometimes this is as simple as a
transcriber writing 'their' instead of 'they're'. This problem
applies to the spoken material in the first release of the BNC.

(3) Spoken language often contains fragmentary utterances,
restarts, and so on, leading to sequences which the tagger may
regard as anomalous.

(4) Spoken language does interesting things, not comprehended by
traditional linguistic analysis. The traditional concept of POS
can itself look threatened.

Spoken language is only one kind of orthographically non standard
material which linguists are currently interested in. International
corpora of varieties of English must be testing POS tagging to the
limits. And I have had problems with historical material (Early Modern
English) and texts generated on email & computer conferencing. I rather
hope that POS tagging software has now reached a stage of sophistication
which will allow us to ask interesting questions about the status of
'part of speech' categories in English (or indeed, other languages).

David Graddol