Corpora: Membership Renewal

From: LDC Office (ldc@unagi.cis.upenn.edu)
Date: Thu Jul 27 2000 - 21:23:01 MET DST

  • Next message: LDC Office: "Corpora: Mistake"

    Dear Member,

    This message is to remind you that the LDC membership year now
    matches the calendar year. Membership years 1999 and 2000 are
    currently open. If you have not already done so, you may join for
    either 1999 or 2000 or both. Remember that joining the LDC for a
    membership year is almost always preferable to buying corpora
    outright. There are two reasons for this. First, the cost of a
    membership is typically less than the cost of buying several LDC
    corpora. Second, due to restrictions imposed by some of our
    information providers, some LDC corpora are not available to
    non-members. Membership Year 1999 will close at the end of this
    calendar year.

    We will be sending out the membership renewal notices for the
    2001 Membership Year in early December. Of course, you may join a
    future membership year any time.

    If you have already joined for membership year 2000, thank you
    for your patronage. If you have not yet joined, I would like to
    remind you of the benefits of a 2000 membership to LDC. So far
    we have released 6 collections this year. They are:

    Chinese Treebank (preliminary release)
    Hong Kong Laws Parallel Text
    Hong Kong News Parallel Text
    Korean Newswire Text
    BLLIP 1987-89 WSJ Corpus Release 1
    Santa Barbara Corpus of Spoken American English Part-I

    You can find a link to a description page for each of these
    corpora at: http://morph.ldc.upenn.edu/Catalog/by_year.html#2000

    Current year members also have access to LDC Online and the
    ability to purchase corpora from previous membership years at the
    media costs of $100 per CD.

    We also plan to release the following corpora this year:

    1998 HUB 4 Broadcast News Evaluation English Test Material
    (LDC2000S86) - The evaluation test material used in the 1998
    DARPA/NIST Continuous Speech Recognition Broadcast News Hub-4
    English Benchmark Test administered by the NIST Spoken Natural
    Language Processing Group. Approximately three hours of English
    Broadcast News from PRI, ABC News, Cable News Network, and the
    University of Southern California with UTF Transcripts.

    NRL Speech in Noisy Environments (SPINE) Audio Training Data -
    The training data set of audio files of multiple speakers using
    various vocoder and microphone headsets to communicate in
    coordinated tasks at remote locations. Approximately 140
    conversations of five minutes each.

    NRL Speech in Noisy Environments (SPINE) Training Data
    Transcripts - The transcript files for the previous SPINE audio
    publication.

    TDT-2 Careful Transcriptions - 10 hours of BC audio from the TDT-2
    corpus transcribed to Hub-4 specification
    for use in ASR.

    TDT-2 Audio Mandarin - Audio of the VOA Chinese Broadcasts from
    Feb-Jun 1998. The transcripts are provided in the TDT-2 Mandarin
    Text or TDT-2 Multilanguage Text.

    Czech VOA Audio and Transcripts - Approximately 30 hours of VOA
    broadcast news in Czech collected during the summer of 1999 with
    the associated transcripts created at the University of West
    Bohemia in the Czech Republic (used in the JHU 1999 Summer
    Workshops).

    1999 HUB 4 Broadcast News Evaluation English Test Material - The
    evaluation test material used in the 1999 DARPA/NIST Continuous
    Speech Recognition Broadcast News Hub-4 English Benchmark Test.
    Approximately one and a half hours of broadcast news audio and
    transcripts.

    1999 HUB 4 Broadcast News Evaluation Non English (Mandarin) Test
    Material - The evaluation test material prepared in accordance
    with the DARPA/NIST Continuous Speech Recognition Broadcast News
    Hub-4 Non English Benchmark Test, however the test was not
    conducted. Approximately one and a half hours of broadcast news
    audio and transcripts.

    TREC Chinese - This is the set of documents used for the Chinese
    task in TRECs 5-6. It consists of approximately 170 megabytes of
    articles drawn from the Peoples Daily newspaper and the Xinhua
    newswire formatted to include TREC document ids. The text is
    Mandarin and is encoded using the Big 5 encoding scheme. The
    topics (questions) and relevance judgments (right answers) that
    complete the test collections can be downloaded from the TREC web
    site (http://trec.nist.gov) in the Data/Non-English section.

    TREC Spanish - This is the set of documents used for the Spanish
    task in TRECs 3-5. It consists of approximately 250 megabytes of
    the Mexican newspaper El Norte and 300 megabytes of Agence France
    Presse 1994 newswire text formatted to include TREC document ids.
    The El Norte documents were used for TRECs 3-4, and the Agence
    France Presse documents for TREC 5. The topics (questions) and
    relevance judgments (right answers) that complete the test
    collections can be downloaded from the TREC web site
    (http://trec.nist.gov) in the Data/Non-English section.

    Japanese Lexicon - This is a revised version of the CallHome
    Japanese lexicon. Revisions include tagging of obsolete forms in
    the original and additions of common place names, days of the
    week, etc., that happen not to occur in the CallHome Japanese
    transcripts.

    Spanish Lexicon - revised version of the CallHome Spanish lexicon
    -- contains additional lexical items from recent transcription
    efforts.

    Mandarin Lexicon - A pronunciation dictionary containing 44,404
    words. It covers both telephone and broadcast speech transcripts
    and text data (newswire) of hub4 Mandarin and hub5 Mandarin. A
    new version - covering ALL of the transcripts and text data - is
    being compiled at present. The lexicon is text-based and
    GB-encoded.

    Thai Newswire - The Thai newswire Krungthep Turakij was collected
    from May, 1997 until July, 1999. It is encoded in TIS-620, and
    has been tagged using a simple, Tipster style tagging scheme.

    TDT-3 Text/Audio - audio and text from the TDT-1999 includes 8
    English and 3 Mandarin sources (television, radio and newswire)
    collected from Oct-Dec 1998 divided into stories and exhaustively
    annotated for relevance to 60 topic selected from the corpus.

    Classified Ads collected from the Internet sites of several major
    newspapers. The ads have been annotated to show the full reading
    of both standard and non-standard abbreviation used. The corpus
    was collected for the JHU 1999 Summer Workshop and annotated by
    Richard Sproat and his group at the workshop.

    LDC Named Entity Tags - Named entity style annotations following
    the TREC Named-Entity task definition.

    If you would like to receive further information about these
    corpora or request an invoice for the 2000 membership year,
    please write to <ldc@ldc.upenn.edu> or call 215.573.1275.

    Best,

    Shannon Sears
    Manager, Intellectual Property Rights and Membership
    ----------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3615 Market Street Fax: (215) 573-2175
    Suite 200 email: ldc@ldc.upenn.edu
    Philadelphia, PA 19104-2608 www: http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Jul 27 2000 - 21:22:35 MET DST