[Corpora-List] New Corpora from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed May 26 2004 - 18:40:04 MET DST

  • Next message: Thomas LEBARBE: "[Corpora-List] Site REPTIL - Etat des lieux des formations TAL"

    LDC2004S04
    ** 2002 NIST Speaker Recognition Evaluation (SRE) **
    **
    **LDC2004T11
    ** Arabic Treebank: Part 3 v.1.0 * *

    LDC2004S05
    ** ISL Meeting Corpus Speech Part 1 **
    **
    **LDC2004T10
    ** ISL Meeting Corpus Transcripts Part 1 **
    *
    *

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of four new corpora.

    *

    (1) The 2002 NIST Speaker Recognition Evaluation is part of an ongoing
    series of yearly evaluations conducted by NIST. These evaluations
    provide an important contribution to the direction of research efforts
    and the calibration of technical capabilities. They are intended to be
    of interest to all researchers working on the general problem of text
    independent speaker recognition.

    The 2002 NIST Speaker Recognition Evaluation main data was extracted
    from the Switchboard Cellular part 2. The extended data task used two
    phases of Switchboard II, phases 2 and 3. This evaluation also included
    the first multi-modal task, using data from the FBI voice database.
    There are a total of 9153 speech files in sphere format, for a total of
    ~156 hours. 2002 NIST Speaker Recognition Evaluation is distributed on
    2 DVD.

    For further information, including a link to the 2002 NIST Speaker
    Recognition Evaluation website, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04

    Institutions that have membership in the LDC for the 2004 Membership
    Year will be able to receive this corpus free of charge. Nonmembers may
    license this data for US$1000.

    *

    (2) Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of
    1,000,000 words of Arabic Treebank, designed to support language
    research and development of language technology for Modern Standard
    Arabic. This corpus includes 600 stories from the An Nahar News Agency.
    There are a total of 340,281 words (counting non-Arabic tokens such as
    numbers and punctuation) in the 600 files - one story per file. New
    features of annotation include complete vocalization (including case
    endings), lemma IDs, and more specific POS tags for verbs and particles.

    The corpus contains 293,035 Arabic-only word tokens (prior to the
    separation of clitics), of which 290,842 (99.25%) were provided with an
    acceptable morphological analysis and POS tag by the morphological
    parser, and 2,193 (0.75%) were items that the morphological parser
    failed to analyze correctly. Arabic Treebank: Part 3 v 1.0 is
    distributed on 1 CD.

    For further information, including online documentation, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11

    Institutions that have membership in the LDC for the 2004 Membership
    Year will be able to receive this corpus free of charge. Nonmembers may
    license this data for US$3000.

    *

    (3) ISL Meeting Speech Part 1 is the first subset of the ISL Meeting
    Corpus (112 meetings). It contains 18 meetings collected at the
    Interactive Systems Laboratories at Carnegie Mellon University. The
    recorded meetings were either natural meetings where participants needed
    to meet in the real world, or artificial meetings, which were designed
    explicitly for the purposes of data collection but still had real topics
    and tasks. The duration of the meetings in this corpus ranges from 8 to
    64 minutes and averages at 34 minutes. Word-level orthographic
    transcriptions are available as ISL Meeting Transcripts Part 1
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10>.

    ISL Meeting Speech Part 1 includes 105 speech files, for a total of
    approximately 10 hours of meeting speech. There are a total of 31
    unique speakers in the corpus. Meetings involved anywhere from 3 to 9
    participants, averaging at 5. The corpus contains a significant
    proportion of non-native English speakers, varying in fluency. ISL
    Meeting Speech Part 1 is distributed on 2 DVD.

    For further information, including a link to the ISL Meeting Room
    project page, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05

    Institutions that have membership in the LDC for the 2004 Membership
    Year will be able to receive this corpus free of charge. Nonmembers may
    license this data for US$1500.

    *

    (4) The ISL Meeting Transcripts Part 1 is the corresponding
    transcription for ISL Meeting Speech Part 1
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05>.
    This corpus consists of 19 word-level transcripts of 18 meetings, time
    synchronized to digitized audio recordings. There are approximately
    116200 word tokens and 5850 unique word types in the transcripts.

    Transcriptions were prepared by means of the TransEdit transcription
    application. This application was developed for the transcription of
    multi-channel recordings and displays a synchronized multi-track view
    for all channels of a meeting with listening and segmentation function
    for each single channel separately. ISL Meeting Transcripts Part 1 is
    distributed by ftp transfer.

    For further information, including a sample transcript, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10

    Institutions that have membership in the LDC for the 2004 Membership
    Year will be able to receive this corpus free of charge. Nonmembers may
    license this data for US$500.

    *

    If you need additional information or would like to inquire about
    membership in the LDC, please send email to <ldc@ldc.upenn.edu> or call
    1 (215) 573-1275.

    ----------------------------------------------------------------------------------------------------
    Linguistic Data
    Consortium
    Phone: 1 (215) 573-1275
    University of Pennsylvania
                                 Fax: 1 (215) 573-2175
    3600 Market St., Suite
    810
    email: ldc@ldc.upenn.edu
    Philadelphia, PA
    19104-2653 www:
    http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed May 26 2004 - 19:03:23 MET DST