Re: Corpora: Broadcast corpus

From: David Graff (graff@unagi.cis.upenn.edu)
Date: Mon Jan 17 2000 - 19:58:25 MET

  • Next message: Termilat: "Corpora: Re: for spanish IN SPANISH, Termilat"

    I'd like to clarify the availability of corpora from the LDC for working with
    topic-sensitive language modeling. Of the corpora mentioned in earlier
    messages (thanks Raman!):

               LDC98T31 1996 CSR Hub-4 Language Model
               LDC97T22 1996 English Broadcast News Transcripts (Hub-4)
               LDC98T28 1997 English Broadcast News Transcripts (Hub-4)
               LDC98T24 1997 Mandarin Broadcast News Transcripts (Hub-4NE)
               LDC98T29 1997 Spanish Broadcast News Transcripts (Hub-4NE)

               LDC99T36 USC Marketplace Broadcast News Transcripts

    Only the first of these contains enough bulk to support experiments on
    adaptive LM's: it contains a 4.5-year archive of broadcast transcripts
    (1992/01 - 1996/06), originally derived from Primary Source Media. The LDC
    has not done any topic annotation on this collection, but the sgml-formatted
    text files do preserve a variety of information about each story that was
    supplied by PSM, including keywords, story titles and/or headlines.

    Unfortunately, due to constraints imposed by copyright owners, all Hub-4
    corpora (including the LM collection) are available only to LDC members. (The
    "USC Marketplace" transcripts are available to non-members, but account for
    only about 40 hours worth of broadcasts.)

    Other corpora that might be useful for topic-based LM research are the TDT
    collections:

             LDC99T39 TDT2 Multilanguage Text
             LDC99T38 TDT2 Mandarin Text
             LDC99T37 TDT2 English Text
             LDC98T25 TDT Pilot Study Corpus

    The "Multilanguage" collection is simply the combination of the TDT2 Mandarin
    and English collections. These are available to non-members (please check our
    catalog -- www.ldc.upenn.edu/Catalog); the time span covered is only six
    months (1998/01-06), but it includes over 700 hours of English broadcasts
    (only about 60 hours of Mandarin broadcasts), plus an equivalent amount of
    newswire data. All English stories in the collection are labeled with respect
    to 100 selected topics, and all Mandarin stories are labeled with respect to
    20 topics (selected from the 100 topics defined for English).

            Dave Graff
            LDC



    This archive was generated by hypermail 2b29 : Mon Jan 17 2000 - 19:58:08 MET