Re: [Corpora-List] POS tagging without training data?

From: Chris Brew (cbrew@ling.ohio-state.edu)
Date: Wed May 21 2003 - 19:48:45 MET DST

  • Next message: info@folli.org: "[Corpora-List] ESSLLI-2004 Call for Proposals"

    If you have the patience, and are willing to acquire (or hire)
    the necessary expertise the tagger described in

    http://citeseer.nj.nec.com/cutting92practical.html

    is likely to do the job for you, without the need for
    significant amounts of tagged training data. This
    tagger needs a large amount of unlabelled text,
    a lexicon, and a little information about morphology
    and possible tag sequences. This has been done for
    Spanish

    http://xxx.arxiv.cornell.edu/abs/cmp-lg/9505035

    The original Xerox code is available at:

    ftp://ftp.parc.xerox.com/pub/tagger/

    If you read about how to build the tagger for Spanish, then do
    likewise for Afrikaans, you'll have a decent POS tagger which
    may already meet your needs.

    In fairness, I should also note

    Bernard Merialdo, Tagging English Text with a Probabilistic Model
    Computational Linguistics, 1994

    which points out that if you _do_ have large amounts of reliably tagged
    training data, you may be able to improve your results. But largely
    unsupervised tagging is certainly an option worth exploring.

    Once you have run your text through your version of the Xerox
    tagger, the text will be tagged, probably with decent accuracy.
    If, for some reason, this isn't good enough, you could indeed
    treat the output (possibly after hand editing)
    as training material for some other tagger. It's hard to predict
    how much this would help (but Thorsten Brants did some experiments
    suggesting that really large amounts of imperfect training data
    can be helpful).

    Good luck with your enterprise

    Chris

    >
    > We want to develop a POS tagger for Afrikaans. We only have very small
    > corpora (roundabout 1,5 million words in total), none of which is
    > annotated (with the exception of a tagged lexicon, without any context).
    > We're considering adapting an existing tagger for, say, English or
    > Dutch, in order to create training data. We want to know:
    >
    > (1) What "shell" (e.g. Brill, TnT, TiMBL, TOSCA, etc.) would be the
    > most effective/efficient to use to create training data? And how much
    > initial training data (i.e. manually tagged data) is needed to do this
    > ?
    > (2) How much training data is needed to develop a reasonably accurate
    > (let's say 95%) version of, for example, a Brill tagger for Afrikaans?
    >
    > Thanks in advance for your help. We'll post a summary.
    >
    > Yours,
    > Gerhard van Huyssteen & Sulene Pilon
    >
    >
    >
    > __________________________________________________________
    > __________________________***_____________________________
    > Dr Gerhard B van Huyssteen
    > School for Languages || Potchefstroom University for CHE ||
    > POTCHEFSTROOM || 2531 || South Africa
    > Skool vir Tale || Potchefstroomse Universiteit vir CHO || POTCHEFSTROOM
    > || 2531 || Suid-Afrika
    >
    > Tel: +27 18 299 1488
    > Fax: +27 18 299 1562
    > afngbvh@puknet.puk.ac.za
    > __________________________________________________________
    > __________________________***_____________________________
    >
    > Hierdie boodskap (en aanhangsels) is onderhewig aan beperkings en `n
    > vrywaringsklousule. Volledige besonderhede beskikbaar by
    > http://www.puk.ac.za/itb/e-pos/disclaimer.html, of by
    > itbsekr@puknet.puk.ac.za
    > This message (and attachments) is subject to restrictions and a
    > disclaimer. Please refer to
    > http://www.puk.ac.za/itb/e-pos/disclaimer.html for full details, or at
    > itbsekr@puknet.puk.ac.za
    > __________________________________________________________
    > __________________________***_____________________________

    -- 
    ==================================================================
    Dr. Chris Brew,  Assistant Professor of Computational Linguistics
    Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210
    Tel:  +614 292 5420 Fax: +614 292 8833
    Web:http://www.ling.ohio-state.edu/~cbrew Email:cbrew@ling.osu.edu
    ==================================================================
    



    This archive was generated by hypermail 2b29 : Wed May 21 2003 - 19:49:58 MET DST