[Corpora-List] A Christmas Present from Lancaster (Part Two)

From: Mcenery, Tony (eiaamme@exchange.lancs.ac.uk)
Date: Tue Dec 23 2003 - 13:20:45 MET

  • Next message: Evgeniy Gabrilovich: "RE: [Corpora-List] content based categories"

    Dear All,

    I am delighted to be able to announce the release of the EMILLE/CIIL
    corpus. The corpus contains monolingual written corpus data for 14 South
    Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri,
    Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). It
    also contains orthographically transcribed spoken data and parallel
    corpus data for five South Asian languages (Bengali, Gujarati, Hindi,
    Punjabi and Urdu). In addition, the parallel corpus contains the English
    originals from which the translations stored in the corpus were derived.
    All data in the corpus is CES and Unicode compliant. The EMILLE corpus
    totals some 94 million words.

    The corpora were built as part of a collaboration between Lancaster
    University and the Central Institute of Indian Languages, Mysore.

    As well as the corpora, the following materials are also available for
    download from the web-site:

    i.) documentation relating to the corpus;
    ii.) POS tagged Urdu corpus data;
    iii.) Hindi corpus data in which demonstrative use has been subject to
    annotation;
    iv.) A prototype POS tagger for Urdu.

    The corpus can be downloaded from:

    http://www.ling.lancs.ac.uk/corplang/emille

    More details of the EMILLE project can be found at:

    http://www.emille.lancs.ac.uk

    The GATE language engineering architecture has also been developed
    further by the University of Sheffield to enable language processing
    tasks using the EMILLE data. For more details on GATE see:

    http://www.gate.ac.uk/

    A new release of the EMILLE corpus will be made, indexed for use with
    Xara, towards spring 2004.

    Apologies if you receive this message more than once.

    Regards,

    Tony McEnery,
    Professor of English Language and Linguistics,
    Dept. Linguistics and Modern English Language,
    Lancaster University,
    Bailrigg,
    Lancaster,
    LA1 4YT.

    I've stopped 14,921 spam messages. You can too!
    One month FREE spam protection at http://www.cloudmark.com/spamnetsig/



    This archive was generated by hypermail 2b29 : Tue Dec 23 2003 - 13:24:19 MET