Re: Corpora: in-line PoS tagger

From: Satoshi Sekine (sekine@cs.nyu.edu)
Date: Thu Sep 06 2001 - 16:05:42 MET DST

Next message: Susan Pintzuk: "Corpora: annotated OE corpus available"

Previous message: LAWSON, Ann: "Corpora: OUP Text Development Editor"
Maybe in reply to: Matthew Purver: "Corpora: in-line PoS tagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Re: POS tagger

Thank you very much for introducing my parser (Apple Pie Parser).
However, I think the accuracy of the output is not as good as Brill's.

By the way, I have improved Brill's tagger by hand. I read his rule
by myself and modify it. For example, I introduced be-verb, have-verb,
number (and more) classes, clean up "there" rules, aux-verb, other verbs
and (and more and more). The accuracy improved from 96.5 to 97 on
test data (and we have an evidence that this is almost the upper limit
because of errors in Penn Treebank. i.e. if you find a better accuracy,
it may be overtrained). It works with stdin/stdout and files.

I have not published the paper (unfortunately rejected by a conference),
but I can provide it to people who really want it (It's still on a
development stage, I was thinking to make it public sometime in the
next year).

The tagger came with some other functions, like sentence splitter,
tokenzer, stemmer, chunker and NE tagger (Some of them are not completed
yet. Also I'm working on implementing dependency analyzer, parser,
function tagger and reguralizer.) It (will) supports several formats
including PTB-tree, PTB-blacket, COLLINS parser input format, MUC format,
CONLL format, (tipster architecture) and (SGML).

The system is called "OAK system".
You can find an introduction page at
http://cs.nyu.edu/cs/projects/proteus/oak

Satoshi Sekine
sekine@cs.nyu.edu

Next message: Susan Pintzuk: "Corpora: annotated OE corpus available"
Previous message: LAWSON, Ann: "Corpora: OUP Text Development Editor"
Maybe in reply to: Matthew Purver: "Corpora: in-line PoS tagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Sep 06 2001 - 23:06:04 MET DST