Corpora: POS disambiguation

D.H. Van Uytsel (Donghoon.VanUytsel@esat.kuleuven.ac.be)
Wed, 22 Oct 1997 15:42:58 +0200 (MET DST)

Dear corpora subscriber,

I would like to tag a running text containing a few M words. It is not the
focus of my research, so I can't spend too much time on this. As a poor
researcher, I have looked around for some good freeware. For my purposes, it
should be

- compilable for Solaris, or easy to port to it
- allow user-defined tag sets and lexica
- allow use with any western language
- allow unsupervised training
- fast in tagging
- fast in training
- accurate (95% correct tagging)
- robust against overtraining

I am aware of the MULTEXT tagger, Eric Brill's tagger and the Xerox's tagger
(tough I didn't find the source of the latter). I am in favour of Brill's
rule-based tagger because it is much more robust against overtraining than
HMM taggers, and because it is small and elegant. Brill's freeware training
program however takes too much time for me to be practically useful. I think
I could speed it up dramatically, but then I'll have to rewrite the program.
Before I start doing this, has anybody else some better ideas?

Dong Hoon.

____________________________________________________________________________
D.H. Van Uytsel (016)32.1859 http://www.esat.kuleuven.ac.be/~donghoon