Re: Corpora: in-line PoS tagger

From: Satoshi Sekine (sekine@cs.nyu.edu)
Date: Thu Sep 06 2001 - 16:05:42 MET DST

  • Next message: Susan Pintzuk: "Corpora: annotated OE corpus available"

    Re: POS tagger

    Thank you very much for introducing my parser (Apple Pie Parser).
    However, I think the accuracy of the output is not as good as Brill's.

    By the way, I have improved Brill's tagger by hand. I read his rule
    by myself and modify it. For example, I introduced be-verb, have-verb,
    number (and more) classes, clean up "there" rules, aux-verb, other verbs
    and (and more and more). The accuracy improved from 96.5 to 97 on
    test data (and we have an evidence that this is almost the upper limit
    because of errors in Penn Treebank. i.e. if you find a better accuracy,
    it may be overtrained). It works with stdin/stdout and files.

    I have not published the paper (unfortunately rejected by a conference),
    but I can provide it to people who really want it (It's still on a
    development stage, I was thinking to make it public sometime in the
    next year).

    The tagger came with some other functions, like sentence splitter,
    tokenzer, stemmer, chunker and NE tagger (Some of them are not completed
    yet. Also I'm working on implementing dependency analyzer, parser,
    function tagger and reguralizer.) It (will) supports several formats
    including PTB-tree, PTB-blacket, COLLINS parser input format, MUC format,
    CONLL format, (tipster architecture) and (SGML).

    The system is called "OAK system".
    You can find an introduction page at
    http://cs.nyu.edu/cs/projects/proteus/oak

    Satoshi Sekine
    sekine@cs.nyu.edu



    This archive was generated by hypermail 2b29 : Thu Sep 06 2001 - 23:06:04 MET DST