Corpora: New Corpora from the LDC

From: LDC Office (ldc@ldc.upenn.edu)
Date: Thu Jan 03 2002 - 16:43:19 MET

  • Next message: Yuri Tambovtsev: "Corpora: very interesting discussions but not on phonology"

                 ** Chinese Treebank Version 2.0 **

               ** Switchboard Cellular Part 1 Audio **

            ** Switchboard Cellular Part 1 Transcription **

         ** Switchboard Cellular Part 1 Transcribed Audio **

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of four new corpora.

                              **

    1. Chinese Treebank Version 2.0 is the continuation of a project
    started in Summer 1998; the project's goal is the creation of a 100,000
    word corpus of Chinese with syntactic bracketing. The corpus contains
    approximately 100,000 words drawn from 325 Xinhua newswire articles
    dating from 1994 to 1998. Version 2.0 is GB encoded and formatted
    similarly to the UPenn English Treebank except that some original file
    information was retained such as "SRCID" and "DATE" in the data file.
    Please note that Chinese Treebank 2.0 supersedes and replaces the
    Chinese Penn Treebank Final Release (LDC2000T48).

    For more information, including samples and a link to the The Chinese
    Treebank Project website, please visit:

    http://www.ldc.upenn.edu/Catalog/LDC2001T11.html

    Institutions that have membership in the LDC during the 2001
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may purchase this publication for $200.

    2. The Switchboard Cellular Part 1 project focused primarily on GSM
    cellular phone technology. The project's goal was to target 190
    subjects, balanced by gender, under varied environmental conditions to
    participate in (10+) 5-6 minute conversations on GSM cellular phones.
    The data was collected for research, development, and evaluation of
    automatic systems for speech-to-text conversion, talker identification,
    language identification and speech signal detection purposes.

    Part 1 consists of three corpora: Audio, Transcriptions, and Transcribed
    Audio. All three corpora contain documentation describing speaker
    information, call information, and audit information.

    The Switchboard Cellular Part 1 Audio release is a 13 CD-ROM publication
    which contains approximately 65 hours of audio speech data. The Audio
    corpus totals 1309 calls, or 2618 sides (1957 GSM), from 254
    participants (129 Male, 125 Female). The data files are not compressed.

    For further information, please visit:

    http://www.ldc.upenn.edu/Catalog/LDC2001S13.html

    Institutions that have membership in the LDC during the 2001
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may purchase this publication for $2600.

    3. Switchboard Cellular Part 1 Transcription is an ftp file which
    contains the 250 transcriptions of speech data files that correspond
    with the Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15).
    Calls were transcribed using conventions similar to HUB-5 English.

    For more information, including an example transcript, please visit:

    http://www.ldc.upenn.edu/Catalog/LDC2001T14.html

    Institutions that have membership in the LDC during the 2001
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may purchase this publication for $1000.

    4. Switchboard Cellular Part 1 Transcribed Audio, a 3 CD-ROM
    publication, contains the 250 speech data files that correspond with the
    Switchboard Cellular Part 1 Transcription (LDC2001T14). The data files
    are not compressed. There is approximately 12 hours of audio data.

    For more information, please see:

    http://www.ldc.upenn.edu/Catalog/LDC2001S15.html

    Institutions that have membership in the LDC during the 2001
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may purchase this publication for $600.

                               **

    If you need additional information before placing your order, or
    would like to inquire about membership in the LDC, please send email to
    <ldc@ldc.upenn.edu> or call (215) 573-1275.

    ---------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3615 Market Street Fax: (215) 573-2175
    Suite 200 email: ldc@unagi.cis.upenn.edu
    Philadelphia, PA 19104-2608 www: http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Jan 03 2002 - 16:44:19 MET