Evaluating taggers

Jem Clear (jem@cobuild.collins.co.uk)
Thu, 13 Jul 1995 14:55:36 +0100

David Graddol writes:

> I rather hope that POS tagging software has now reached a stage of
> sophistication which will allow us to ask interesting questions about
> the status of 'part of speech' categories in English (or indeed, other
> languages).

Hurrah, hurrah! Well said! As a corpus-builder (and -tagger) of a few
years experience I can't resist adding my wholehearted support to the
view expressed by David Graddol. I have found over the years that the
more one engages with the real-life text that is available through
corpus collections the more the received categories, methodologies,
analytical procedures, etc of traditional linguistics are called into

It seems to me that corpus linguists have tended in the past to
conform to the expectations of the established majority of
(non-corpus) linguists and have tagged and parsed their corpora
because it was assumed that only when such tagging and parsing were
complete would the corpus be useful. (This is certainly part of the
reason why the British National Corpus has been tagged before release
-- because 99.5% of the anticipated "consumers" of the corpus would
expect/demand it.) The Cobuild "Bank of English" corpus has been
tagged in several different ways over the years and most recently
parsed with the Univ of Helsinki ENGCG parser, and after working with
the data for some years I am confirmed in my hunch that traditional
syntactic analyses seem to be clumsy and inappropriate for large
amount of real-life text (written and spoken) and surprisingly
unenlightening. Of course, I am partisan: the reason I do corpus
linguistics and computational lexicography is because I have an
entrenched belief that gathering and analysing large amounts of real
data will bring a whole new "linguistics" into play.

In defence of the corpus-builders, I must say that we are *forced* to
do these things. Linguists who come to Birmingham to use the Bank of
English or who in some way or other make use of it assume

a) that it is desirable to tag and parse the corpus
b) because corpora are know to have been tagged before and taggers are
known to exist and be widely used, that we will have tagged the corpus
c) that the accuracy rate ("Oh! only 92% -- that's not very good,
Blenkinsop's GIZMO tagger achieves 97.87% accuracy.") is an important
feature that needs to be quoted and compared.

The accuracy of tagging is a wide open issue. Who says what's correct?
(For an absolutely classic example of the problem, look at the
Wordwatch feature on the Web at
And what categories (the tag set and the meaning of the tags) are
applicable? Is there any consensus? Only a set of historical
precedents. The historical precedents (the Brown tagset, LOB tagset,
CLAWS tagset) are inherently slow to evolve because the statistical
basis of many tagging programs relies on the availability of a
"training set" of data. So there is a built-in constraint on the
development of new models of corpus analysis.

Well, that's my tuppennyworth!

Jem Clear