Corpora: ELRA News

From: Magali Duclaux (
Date: Fri Feb 01 2002 - 17:17:19 MET

  • Next message: Fernando Martínez Santiago: "Corpora: english multiwords list"

    [Our apologies if you receive multiple copies of this announcement]

    ELRA - European Language Resources Association

    We are pleased to announce some new resources
    available in our catalogue of language resources:

    S0119 Spanish SpeechDat Database for the Mobile Telephone Network
    W0032 Modern French Corpus including Anaphors Tagging
    W0033 CRATER 2

    A short description of these three new resources is given
    below. Please visit the online catalogue to get further details:

    S0119 Spanish SpeechDat Database for the Mobile Telephone Network
    The Spanish SpeechDat database for the mobile telephone network
    comprises 1066 Spanish speakers (526 males, 540 females) calling
    from GSM telephones and recorded over the fixed PSTN using and
    ISDN-BRI interface. The database was produced by Applied Technologies
    in Language and Speech S.L. (Spain). The MDB-1000 database is
    partitioned into 6 CDs in ISO 9660 format. This database follows the
    specifications given in the framework of the SpeechDat(II) project.
    Speech samples are stored as sequences of 8-bit 8 kHz A-law.
    Each prompted utterance is stored in a separate file. Each signal file
    is accompanied by an ASCII SAM label file which contains the relevant
    descriptive information.
    Each speaker uttered the following items:
    · 2 isolated digits.
    · 1 sequence of 10 isolated digits.
    · 4 connected digits: 1 sheet number (6 digits), 1 telephone number
    (9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits).
    · 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date
    (word style), 1 relative and general date expression.
    · 1 word spotting phrase using an application word (embedded).
    · 6 application words.
    · 3 spelled words: 1 spontaneous name (own forename), 1 city
    name, 1 real / artificial word for coverage.
    · 1 currency money amount.
    · 1 natural number.
    · 6 directory assistance names: 1 surname (set of 500), 1 city of
    birth / growing up, 1 most frequent cities (set of 500), 1 most frequent
    company / agency (set of 500), 1 ‘forename surname’ (set of 150), 1
    spontaneous forename.
    · 2 questions including ‘fuzzy’ yes / no: 1 predominantly ‘Yes’ question,
    1 predominantly ‘No’ question.
    · 9 phonetically rich sentences.
    · 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word
    · 4 phonetically rich words.
    · Call environment.
    The following age distribution has been obtained: 5 speaker are below 16
    years old, 543 speakers are between 16 and 30, 307 speakers are
    between 31 and 45, 202 speakers are between 46 and 60, 9 speakers are
    over 60. A pronunciation lexicon with a phonemic transcription in SAMPA is
    also included.

    W0032 Modern French Corpus including Anaphors Tagging
    The corpus that includes the tagging of the anaphors was created by
    the CRISTAL-GRESEC (Stendhal-Grenoble 3 University, France) team
    and XRCE (Xerox Research Centre Europe, France) in the framework of
    the call launched by the DGLF-LF (national institution for the French
    language and the languages spoken in France), for the creation of modern
    French corpora).
    Over 1 million words have been annotated. The corpora have been selected
    so that they represent a wide sampling of the French language (scientific
    and human science articles, extracts from newspapers and magazines,
    legal texts, etc.) and according to the points of interest of the teams working
    on the project. The processed corpora supplied by ELRA are listed below:
    - Two books edited by the CNRS: La protection des oeuvres scientifiques
    en droit d'auteur français, Xavier Strubel. Paris, CNRS Editions, 1997 (77 591
    words) and Cinquante ans de traction à la SNCF. Enjeux politiques, économiques
    et réponses techniques, Clive Lamming. Paris, CNRS Editions, 1997 (124 990
    - 204 articles extracted from CNRS Info, a magazine which contains short
    popular scientific articles from the CNRS laboratories (201 280 words).
    - 14 articles dealing with Hermès Human Sciences (111 886 words).
    - 136 articles extracted from "Le Monde", dealing with economics (roughly
    180 760 words).
    - 13 booklets of the Official Journal of the European Communities
    337 000 words).

    Below the tagged anaphoric elements:
    - Person pronouns: 3rd person pronoun, anaphoric.
    - Possessive determiners: 3rd person possessive determiner.
    - Demonstrative pronouns: anaphoric pronouns (celui, celle, ceux,
    - Indefinite pronouns: Aucun(e), chacun(e), certain(e)s, l'un(e), les
    tout(es), etc, when they are anaphoric.
    - "Proverbs": "le" + "faire".
    - Anaphoric and cataphoric adverbs: Dessus, dedans, dessous , when
    they have an anaphoric function.
    - Ellipsis of head nouns: Nominal adjectives or quantifiers determiners
    - Textual headers like "ce dernier": Ce dernier, le premier , etc.
    The annotation scheme was defined in XML format. The texts were divided
    into sections, paragraphs (<p>) and sentences (<s>). The sentence
    segmentation was carried out with
    NLP tools developed by XRCE, the annotation part was done manually by two
    qualified linguists. A large subset of anaphoric phrases was automatically
    pre-annotated. The antecedents and the tagging of the anaphoric relations
    were manually processed, but editing tools (emacs, macros from Author/Editor
    software) were used to make it easier. 5% of the corpora were evaluated to
    the annotation reliability.

    W0033 CRATER 2
    The CRATER corpus was built upon the foundations of an earlier project,
    ET10/63, which was funded in the final phase of the Eurotra programme.
    The Corpus Resources and Terminology Extraction project (MLAP-93 20)
    extended the bilingual annotated English-French International
    Union corpus produced within ET10/63 to include Spanish.
    The CRATER 2 corpus was produced by the Department of Linguistics & Modern
    English Language, Lancaster University (United Kingdom) with funding from
    ELRA. The ELRA funding in turn was provided by the European Commission
    project LRsP&P (Language Resources Production & Packaging - LE4-8335).
    This project has enhanced the CRATER corpus, available under the reference
    ELRA-W0003 in the ELRA catalogue. CRATER 2 has significantly expanded
    the French/English component of the parallel corpus by increasing the size
    of the English/French corpus from 1,000,000 words per language to
    approximately 1,500,000 tokens per language. CRATER 2 is sold with CRATER
    in a single package.

    For further information, please contact:

    55-57 rue Brillat-Savarin
    F-75013 Paris, France

    Tel: +33 01 43 13 33 33
    Fax: +33 01 43 13 33 30


    or visit our Web site:

    This archive was generated by hypermail 2b29 : Fri Feb 01 2002 - 17:32:26 MET