Corpora: Amaryllis web address

From: Magali Duclaux (duclaux@elda.fr)
Date: Fri Oct 05 2001 - 10:48:53 MET DST

  • Next message: Magali Duclaux: "Corpora: ELRA news 2/2"

    [Our apologies if you receive multiple copies of this announcement]

    ************************************************************
    ELRA - European Language Resources Association
    ************************************************************

    A new resource is available in our catalogue of
    Language Resources:

    ELRA-W0029 Amaryllis Corpus

    ***********************************************************
    ERRATUM

    The URL to the Amaryllis web site is: http://amaryllis.inist.fr/
    ***********************************************************
    A description of this new resources is given
    below:

    Launched at the end of 1995, the AMARYLLIS project
    aimed at evaluating information retrieval software for
    French text corpora in order to provide a methodology
    for the evaluation of other similar tools. AMARYLLIS
    was organised by the Institut de l'Information Scientifique
    et Technique (INIST) with the support of the Agence
    francophone pour l'enseignement supérieur et la
    recherche (AUPELF-UREF) and the French Ministère de
    l'Education Nationale, de la Recherche et de la Technologie
    (MERT).
    More specifically, the objective was to create document
    corpora, questions and answers, in the framework of the
    Action de Recherche Concertée (ARC A1, renamed as
    Amaryllis- Access to text information in French), in order
    to get similar works to the United States project TREC.
    For more information about the AMARYLLIS project,
    please visit the following web site:
    http://amaryllis.inist.fr/

    All corpora are structured as SGML files with isolatin character
    -encoding.
    The available corpora were provided by:
    - INIST (Institut de l'Information Scientifique et Technique)
    - OFIL (Observatoire Français et International des Industries de
    la Langue)
    - ELRA (European Language Resources Association)

    Each provider provided three types of corpora : text documents,
    search topics and answers to these topics in the corresponding
    text corpora (with frames of reference for the answers).

    1- Text documents in French
    The text documents in French comprise:
    - Articles (titles and texts) extracted from trhe newspaper
    "Le Monde"; each batch contains three months of documents,
    provided by OFIL (01-01-93/31-03-93, 01-04-93/30-06-93),
    - Titles and summaries of scientific articles covering every
    domain from the Pascal bibliographical databases (from 1984
    to 1995) and Francis (from 1992 to 1995), provided by INIST.
    The tagging of the documents conforms to a simplified version
    of a DTD from the TEI, which includes the possibility to manage
    the logical structure.

    2- Multilingual text documents
    The multilingual text documents have been provided by ELRA,
    and comprise documents in 6 languages (French, English,
    Italian, Spanish, German and Portuguese), extracted from the
    parallel corpus MLCC which contains documents translated in
    official European languages (from 1992 to 1994). The corpus was
    divided in two sub-corpora: written questions (10 million words)
    and debates of the European Parliament (5 to 8 million de words
    per language).

    3- Search topics
    The topics derive from questions asked by end users, and should
    contain every information which is necessary to understand
    the issue they deal with and to estimate the relevance. They comprise
    the following items:
    - A domain, to determine the field of knowledge they belong to,
    - A topic: which equals to a title defining the subject,
    - A question: which matches the question the user may ask,
    - Complementary information: which gives details on further documents
    that should be selected from the corpus,
    - Concepts: which are a set of descriptors used to set the limits of the
    search.
    The topics have been built by OFIL, by some documentalists working for
    Le Monde who used requests from journalists, and by engineers responsible
    for documentation at INIST (experts in their domain) who used requests from
    end users. These topics were to cover numerous application fields, and to get
    a large number of relevant results in each corpus. The topics have been tested
    on the corpora to control their relevance. The query may have had to be
    modified,
    or some further details may have been needed.

    4- Frames of reference for the answers
    Answers' files contain for each numbered topic the numbers of all relevant
    documents. Some frames of reference for the answers were established before the
    participants proceeded to the tests. The answers had been selected by the
    providers
    (OFIL and INIST) with the appropriate methodology and adequate tools
    (initial frames
    of reference): they proceeded to a pre-selection of documents as extended
    as possible,
    based not only on their titles and summaries but also on the key words and
    classification
    codes used in the Pascal and Francis databases. These key words and
    classification
    codes can not be accessed by the participants. The results (a set of
    documents) are sorted manually, so that the results match the best the query.
    The initial frames of reference were checked manually by the providers
    (INIST and OFIL),
    using the answers given by the participants. These answers were collected
    when the tests
    were finished. This allowed us to review and correct the frames of
    reference for the answers
    in order to give some even more detailed information for their
    content. The illustration below
    shows how the review was performed.

    The 4 CDs contain each a corpus for the two phases of the two campaigns
    which took place.
    TrecEval is also provided.

    =====================================
    For further information, please contact:
    ELRA/ELDA
    55-57 rue Brillat-Savarin
    F-75013 Paris, France
    Tél. : +33 01 43 13 33 33
    Fax : +33 01 43 13 33 30
    Email: mapelli@elda.fr
    or consult our catalogue at the following address:
    http://www.icp.grenet.fr/ELRA/home.html
    or http://www.elda.fr
    =====================================



    This archive was generated by hypermail 2b29 : Fri Oct 05 2001 - 23:17:50 MET DST