Corpora: New Releases from the LDC

From: LDC Office (ldc@ldc.upenn.edu)
Date: Tue May 08 2001 - 21:20:57 MET DST

  • Next message: Hristo Tanev: "(no subject)"

    The Linguistic Data Consortium (LDC) is pleased to announce the
    release of three resources to support research in Topic Detection and
    Tracking (TDT) and information retrieval:

    1. TDT2 Multilanguage Text Corpus, version 4.0
       LDC2001T57, isbn 1-58563-183-3, 1 CD-ROM
       http://www.ldc.upenn.edu/Catalog/LDC2001T57.html
    2. TDT3 Multilanguage Text Corpus, version 2.0
       LDC2001T58, isbn 1-58563-193-0, 1 CD-ROM
       http://www.ldc.upenn.edu/Catalog/LDC2001T58.html
    3. TDT3 English Audio Corpus
       LDC2001S94, isbn 1-58563-185-x, 55 CD-ROMs
       http://www.ldc.upenn.edu/Catalog/LDC2001S94.html

    You may refer to the LDC's online catalog pages for full
    documentation: http://www.ldc.upenn.edu/Catalog/

    Topic Detection and Tracking refers to automatic techniques for
    finding topically related material in streams of data such as newswire
    and broadcast news. These corpora were created to support the TDT
    tasks of: finding topically homogeneous sections (segmentation),
    detecting the occurrence of new events (detection), and tracking the
    reoccurrence of old or new events (tracking). Taken together the
    corpora contain audio of broadcast news, news texts including
    transcripts of all audio and annotation tables indicating story
    boundaries and the relevance of each story to news topics selected
    from the collection. The TDT corpora have also been used for
    information retrieval, spoken document retrieval and information
    extraction.

    For further information on TDT please visit:
    http://www.ldc.upenn.edu/Projects/TDT. Brief descriptions of each
    corpus are provided below, with information on how to order them.
    -------
    1. TDT2 Multilanguage Text Corpus, Version 4.0 contains news
    data collected daily from nine news sources in two languages (American
    English and Mandarin Chinese), over a period of six months (January -
    June, 1998). Both manually-created reference text and automatically-
    generated text (ASR and/or machine translation) are provided for all
    broadcast and all Mandarin data.

    This version has been prepared to complement the first general release
    of the TDT3 Multilanguage Text Corpus, providing new enhancements to
    make the data content more accessible to a broader research community.

    The news sources, and approximate number of stories per source (in
    thousands), are as follows:

    English sources Thousands of
    stories
    -----------------------------------------------------------------
     New York Times Newswire Service 11.8
     Associated Press Worldstream Service 12.8
     Cable News Network, "Headline News" 15.8
     American Broadcasting Co., "World News Tonight" 2.1
     Public Radio International, "The World" 2.9
     Voice of America, English news programs 8.2
        Total English stories: 53.6
    thousand

    Mandarin sources
    -----------------------------------------------------------------
     Xinhua News Agency 11.3
     Zaobao News Agency 5.2
     Voice of America, Mandarin Chinese news programs 2.3
        Total Mandarin stories: 18.8
    thousand

    Institutions that have membership in the LDC during the
    2001 Membership Year will be able to receive this corpus
    free of charge. The non-member cost is $2,500.

    -------

    2. TDT3 Multilanguage Text Corpus Version 2.0 is the first
    general release of this collection (version 1 was made available only
    to participants in the TDT 1999 and 2000 evaluation tests). It
    contains data from the same nine sources found in TDT2, plus two
    additional English television sources. Like TDT2, it provides both
    manually- created and automatically-generated text for most sources.

    For TDT3, the daily collection took place over a period of three
    months (October - December, 1998). The sources and approximate number
    of stories per source are as follows:

    English sources Thousands of
    stories
    -----------------------------------------------------------------
     New York Times Newswire Service 6.9
     Associated Press Worldstream Service 7.3
     Cable News Network, "Headline News" 9.0
     American Broadcasting Co., "World News Tonight" 1.0
     Public Radio International, "The World" 1.6
     Voice of America, English news programs 3.9
     MS-NBC, "News with Brian Williams" 0.7
     National Broadcasting Co., "NBC Nightly News" 0.8
        Total English stories: 31.2
    thousand

    Mandarin sources
    -----------------------------------------------------------------
     Xinhua News Agency 5.2
     Zaobao News Agency 3.8
     Voice of America, Mandarin Chinese news programs 3.8
        Total Mandarin stories: 12.8
    thousand

    Institutions that have membership in the LDC during the
    2001 Membership Year will be able to receive this corpus
    freeof charge. The non-member cost is $2,500.

    -------

    3. TDT3 English Audio Corpus contains the audio (in compressed
    sphere format) of news broadcasts collected daily from the 6 news
    sources in American English, over the three-month collection period
    (October - December 1998). The sources and amounts are as follows:

    Sources Hours CDs
    ------------------------------------------------------------------
    CNN_HDL Cable News Network, "Headline News" 174.6 19
    ABC_WNT American Broadcasting Co., "World News Tonight" 38.6 5
    NBC_NNW National Broadcasting Co., "NBC Nightly News" 44.6 6
    MNB_NBW MS-NBC, "News with Brian Williams" 51.8 6
    PRI_TWD Public Radio International, "The World" 63.9 7
    VOA_ENG Voice of America, English news programs 102.2 12

    Total 475.7 55

    The files in this publication are complete single-channel recordings
    of the (thirty or sixty minute) broadcasts listed above. Each one has
    been digitized at a sample rate of 16 KHz using 16-bit samples, and
    compressed using the "shorten" algorithm.

    Institutions that have commercial membership in the LDC during the
    2001 Membership Year will be able to receive this corpus free of
    charge. Institutions that have non-profit membership in the 2001
    Membership Year will need to pay a media fee of $1,100 for the full
    set of 55 CD-ROMS. The non-member cost for the full set is $11,000.

    (The audio CD-ROMs are grouped into subsets by broadcast source, and
    the LDC will support the option of purchasing one or more subsets,
    e.g. just the VOA data. We regret that we cannot provide "customized"
    subsets.)

    If you would like to order a copy of any of these corpora, please
    email your request to mailto://ldc@ldc.upenn.edu. If you need
    additional information before placing your order, or would like to
    inquire about membership in the LDC, please send email or call
    (215)573-1275.



    This archive was generated by hypermail 2b29 : Wed May 09 2001 - 09:01:00 MET DST