[Corpora-List] ELRA News

From: Magali Jeanmaire (duclaux@elda.fr)
Date: Wed Sep 22 2004 - 17:17:19 MET DST

  • Next message: Pete Whitelock: "[Corpora-List] GUI for Word Alignment"

    **********************************************************
    ELRA - Language Resources Catalogue - Update
    *********************************************************

    We are happy to announce that new Written Language
    Resources are available in our catalogue.

    You will find below their short descriptions. Please
    visit our on-line catalogue to get more detailed
    information: www.elda.fr and www.elra.info.

    *********************************************************
    *** ELRA-W0037 The EMILLE/CIIL Corpus ***

    The EMILLE/CIIL Corpus consists of monolingual corpora
    containing approximately 92,799,000 words for 14 South Asian
    languages (Assamese, Bengali, Gujarati, Hindi, Kannada,
    Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil,
    Telegu and Urdu) (including 2,627,000 words of transcribed spoken
    data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus
    of 200,000 words in English with translations in Hindi, Bengali, Punjabi,
    Gujarati and Urdu. Annotations include Urdu monolingual and parallel
    corpora annotated for parts-of-speech, and 20 written Hindi corpus files
    annotated to show the nature of demonstrative use. All other components
    are annotated at the sentence level. The corpus is marked up using CES-
    compliant SGML and encoded using Unicode.

    *** ELRA-W0038 The EMILLE Lancaster Corpus ***

    The EMILLE Lancaster Corpus consists of monolingual corpora
    containing approximately 58,880,000 words for seven South Asian
    languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu)
    (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati,
    Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with
    translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include
    Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20
    written Hindi corpus files annotated to show the nature of demonstrative use.
    All other components are annotated at the sentence level. The corpus is
    marked up using CES-compliant SGML and encoded using Unicode.

    *** ELRA-W0039 The Lancaster Corpus of Mandarin Chinese (LCMC) ***

    The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written
    text categories including news, literary texts, academic prose and official
    documents etc published in P. R. China in the earlier 1990s for a total of
    approximately 1 million words. The same sampling frame and period as
    FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8)
    and marked up in XML.

    *********************************************************

    ---------------------------------------------------------------------------
    ELRA / ELDA

    55-57, rue Brillat-Savarin
    75013 Paris FRANCE
    Tel: (+33) 1 43 13 33 33 / Fax: (+33) 1 43 13 33 30
    URL: http://www.elra.info or http://www.elda.fr

    LREC 2004 conference: www.lrec-conf.org/lrec2004/
    LangTech forum: http://www.lang-tech.org
    ---------------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Wed Sep 22 2004 - 17:31:07 MET DST