Re: Corpora: POS disambiguation

Adwait Ratnaparkhi (adwait@unagi.cis.upenn.edu)
Wed, 22 Oct 1997 13:50:02 -0400

D.H. Van Uytsel wrote:

> Dear corpora subscriber,
>
> I would like to tag a running text containing a few M words. It is not the
> focus of my research, so I can't spend too much time on this. As a poor
> researcher, I have looked around for some good freeware. For my purposes, it
> should be
>
> - compilable for Solaris, or easy to port to it
> - allow user-defined tag sets and lexica
> - allow use with any western language
> - allow unsupervised training
> - fast in tagging
> - fast in training
> - accurate (95% correct tagging)
> - robust against overtraining
>

I have written a statistical tagger based on a maximum entropy model ,
which I refer to as MXPOST (for lack of a better name).

It is written in Java, and the executable (i.e., "bytecode") is free for
research purposes.
It should, in theory, run on any platform with a java interpreter.

I distribute a pre-trained model for U.S. financial text (trained from the Penn
Treebank
Wall St. Journal). Performance on the WSJ is state-of-the-art (96.5% - 97.0%
word accuracy)

The tagger requires text annotated with PoS tags as training material.
(I'm not sure what you mean by "unsupervised" training; most of the
high-performing
taggers, e.g. Brill's, require either a tagged corpus to train from or a large
lexicon/dictionary
which states the allowable tags for a given word.)

You can use the tagger on any language for which you have a tagged corpus.
Training time
is under a day if you have a native java compiler (i.e., a "JIT" compiler), and
around a week
if you are using the usual java interpreter. (This is on a 167 Mhz UltraSparc).

You can look at the paper describing the tagger and download the tagger itself
at
http://www.cis.upenn.edu/~adwait/statnlp.html

Hope this helps,

Adwait Ratnaparkhi