[Corpora-List] New Corpora from the LDC

From: LDC Office (ldc@ldc.upenn.edu)
Date: Mon Jan 06 2003 - 17:35:01 MET

  • Next message: Matthew T. Bell: "[Corpora-List] Roberts Rules, appropriate corpora, and computational models"

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of three new corpora.

                 ** 1997 HUB5 Spanish Evaluation **

                 ** 2000 Communicator Evaluation **

        ** Grassfields Bantu Fieldwork: Ngomba Tone Paradigms **

    1. The 1997 Hub-5 Spanish evaluation is part of an ongoing series
    of periodic evaluations conducted by NIST. This evaluation focused
    on the task of transcribing conversational speech into text. Each
    conversation is represented as a "4-wire" recording, that is, with
    two distinct sides, one from each end of the telephone circuit. Each
    side is recorded and stored as a standard telephone codec signal
    (8 kHz sampling, 8-bit mu-law encoding). The 1997 HUB5 Spanish
    Evaluation contain 426 Mbytes or hours of sphere data.

    For further information, including a link to additional documentation on
    the NIST web site, please visit:
     
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S25

    Institutions that have membership in the LDC during the 2002
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may purchase this publication for $1000.

    2. The original goals of the Communicator program were to support the
    creation of speech-enabled interfaces that scale gracefully across
    modalities, from speech-only to interfaces that include graphics,
    maps, pointing and gesture. The original vision of the Communicator
    systems included the ability of a user, during one ten-minute session,
    to plan a three-leg trip, with the three flights/legs on three different
    days, with rental car and hotel in each of the two "away" cities, plus
    dictating/sending a voice-mail message.

    The actual research that led to the data collections in 2000 and 2001
    explored ways to construct better spoken-dialogue systems, with which
    users interact via speech-alone to perform relatively complex tasks such
    as travel planning. During 2000 and 2001 two large data sets were
    collected, in which users used the Communicator systems built by the
    research groups to do travel planning. The 2000 Communicator Evaluation
    publication consists of all the data from the 2000 collection.

    For the 2000 evaluation, each user called the nine different automated
    travel-planning systems to make simulated flight reservations. All audio
    files are in SPHERE format, recorded in 8 bit ulaw and pcm, at 8 KHZ.
    The two-channel sphere files total ~62 hours of audio (3415 MB),
    representing ~317K words in transcription.

    Institutions that have membership in the LDC during the 2002
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may purchase this publication for $900.

    3. Grassfields Bantu Fieldwork: Ngomba Tone Paradigms contains tone
    paradigms of the language Ngomba, a Bamileke (Grassfields Bantu)
    language spoken by some 63,000 people in the Western Province of
    Cameroon. Ngomba's tone system is undescribed, but it has many
    similarities with the closely related Yémba language (also known as
    Bamileke Dschang).

    This publication contains 755 audio files. The files in rawdata are 21
    extended audio and laryngograph recordings with ESPS xlabel files; each
    one of the raw sound files contains the complete recording of one of the
    tenses. Transcriptions are provided for the audio clips using the
    IPA-based orthography, and using phonetic and tonological transcription
    systems. The verbal tone paradigms are also accessible over the
    internet, along with an interface for browsing and editing
    transcriptions, at http://www.ldc.upenn.edu/Projects/grassfields

    For further information, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16

    This publication is free of charge to 2001 and 2002 members. The cost
    of the first 100 copies of this publication (not counting the copies
    distributed to LDC members) is covered by NSF Grant Number 9983258.
    These copies are, therefore, free of charge to qualified researchers;
    a $30 shipping and handling fee applies. After these first 100 copies
    are distributed, additional copies will be available for the production
    cost of $150 per CD-ROM.

                               **

    If you need additional information before placing your order, or
    would like to inquire about membership in the LDC, please send email to
    <ldc@ldc.upenn.edu> or call (215) 573-1275.

    ---------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 email: ldc@unagi.cis.upenn.edu
    Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Mon Jan 06 2003 - 17:38:32 MET