Re: [Corpora-List] English POS tagged corpus

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Fri Nov 19 2004 - 16:49:06 MET

  • Next message: Yorick WIlks: "Re: [Corpora-List] Interlingual Machine Translation Systems (fwd)"

    Gaurav,

    The SourceForge open-source Python Natural Language Toolkit (NLTK)
    http://nltk.sourceforge.net/
    is a student-oriented teaching resource with a bundle of corpus and
    lexical resources including PoS-tagged Brown corpus of US English:

    20_newsgroups genesis lexicon roget treebank
    brown gutenberg names semcor1.7 treebank_swb
    chunking ieer nltk-data-0.3 senseval wordnet
    cmp-lg levin ppattach stopwords words

    It also comes with demo software and easy-to-follow tutorials and
    API documentation for tokenization, tagging, parsing, and probabilistic
    modelling. As it's open-source, new contributions keep on coming;
    eg latest News says "Christopher Maloof's implementation of the Brill
    tagger has been added to the development version of NLTK".

    Of course, other tagged corpora are available from ICAME, LDC, ELRA etc
    but you may have to pay, and they dont come with demo software/tutorials
    (admittedly you didnt say you wanted any associatied software/tutorials
    :-)

    hope this helps

    Eric
    -
    Eric Atwell, Senior Lecturer, Computer Vision and Language research group,
    School of Computing, University of Leeds, LEEDS LS2 9JT, England
    TEL: +44-113-2335430 FAX: +44-113-2335468 http://www.comp.leeds.ac.uk/eric
    On Fri, 19 Nov 2004, Gaurav Malhotra wrote:

    > Hi,
    > Is there an English Parts-of-Speech corpus available for download on the internet. I will be very grateful.
    > Gaurav Malhotra
    >
    >
    > ---------------------------------
    > Do you Yahoo!?
    > The all-new My Yahoo! – Get yours free!



    This archive was generated by hypermail 2b29 : Fri Nov 19 2004 - 16:58:57 MET