Re: Corpora: Brill's POS tags application

eric@scs.leeds.ac.uk
Thu, 25 Jun 1998 10:20:06 +0100

>I'd like to know if Brill's POS tags have been applied any other languages
>other than English. Any references are welcome.

Ketty,
The European-Union-funded EAGLES initiative
(see http://www.ilc.pi.cnr.it/EAGLES96/home.html )
has drawn up recommended linguistic annotations at
a number of linguistic levels, to be applicable to
all EU languages (and, in principle, others). The
recommended POS-tagset is detailed in
"Recommendations for the morphosyntactic
annotation of corpora", by Geoff Leech and Andrew Wilson,
available in html or postscript formats from
http://www.ilc.pi.cnr.it/EAGLES96/browse.html (Text Corpora
section).

If you're looking for a tagset to apply across languages,
I'd recommend this one, as it has been carefully thought
out (by an expert committee) to be `language-neutral'.
As far as I know, "Brill's POS tags" are a (slightly
modified) version of the Brown corpus tagset, the first
attempt to define a tagset for English corpus annotation
(see Greene, B.B. & G.M. Rubin. 1981. "Automatic grammatical
tagging of English" Providence, R.I.: Department of Linguistics,
Brown University) - NOT designed with other languages in
mind, rather for the pragmatic task of POS-tagging an English
corpus with computing resources available at the time
(eg corpus linguists hadn't stumbled across HMMs etc yet...)

In fact, a range of other tagsets have also been designed
for English, which you *might* consider trying with other
languages. For descriptions of 8 of these (Brown, International
Corpus of English (ICE), London-Lund Corpus (LLC), Lancaster-
Oslo/Bergen (LOB), Unix-Parts, Polytechnic of Wales (PoW),
Spoken English Corpus (SEC), Univ of Pennsylvania (UPenn), see
the documentation we collected for our AMALGAM project at Leeds,
http://www.scs.leeds.ac.uk/amalgam/tagsets/tagmenu.html
- and you can get a feel for how they apply to ENGLISH
text by mailing your sample to amalgam-tagger@scs.leeds.ac.uk
(see http://www.scs.leeds.ac.uk/amalgam/amalgam/amalghome.htm )

One other tagset you should consider is the ENGCG tagset, see
http://www.lingsoft.fi/doc/engcg/intro/mtags.html - versions
of the English Constraint Grammar tagger/parser have been/will be
built for other languages, and I *think* basically the same
tagsets are used - you'll have to check this out yourself,
the lingsoft homepage is http://www.lingsoft.fi/
(in particular, check out the answer to the last FAQ - cool!)
A possible drawback for University researchers is that
"Lingsoft, Inc. is a linguistic software company", i.e. they
are not funded by Higher Education Funding Councils or equivalent,
so they have to charge for their services to make a living -
but I note you have a commercial email address so you should have
plenty of funds to pay for corpus linguistics research resources?!?

I hope this helps - I'd be interested to hear more of the background
to your question, why do you want a "language-universal" tagset?

Eric

Eric Atwell, Senior Lecturer in Artificial Intelligence,
Director, Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335761 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/scs/public/staff/eric.html