Corpora: Two New Releases from the LDC

From: LDC Office (ldc@ldc.upenn.edu)
Date: Tue Jul 17 2001 - 22:11:09 MET DST

  • Next message: Rosja Mastop: "Corpora: Second CfP: AC2001 - The Thirteenth Amsterdam Colloquium"

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of two new releases.

    1. Message Understanding Conference (MUC) 7
    LDC2001T02, isbn 1-58563-205-8, ftp file
    http://www.ldc.upenn.edu/Catalog/LDC2001T02.html

    2. CALLHOME Spanish Dialogue Act Annotation
    LDC2001T61, isbn 1-58563-197-3, ftp file
    http://www.ldc.upenn.edu/Catalog/LDC2001T61.html

    --
    

    1. The Message Understanding Conference (MUC) 7 corpus contains texts and annotations of newswire files drawn from the 1996 NY Times News Wire. These newswire files were used in the Message Understanding Conference (MUC) 7 proceedings for the development of information extraction systems.

    Some excerpts from the NIST Information Extraction web page: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html are presented below.

    Information extraction systems have been evaluated under the support of DARPA and other government agencies for almost a decade. Since early 1990, the MUC evaluations have been funding the development of metrics and statistical algorithms to support government evaluations of emerging information extraction technologies.

    In the mid-nineties MUC evaluations began to provide prepared data and task definitions in addition to providing fully automated scoring software to measure machine and human performance. The tasks grew from just production of a database of events found in newswire articles from one source to the production of multiple databases of increasingly complex information extracted from multiple sources of news in multiple languages. The databases now include named entities, multilingual named entities, attributes of those entities, facts about relationships between entities, and events in which the entities participated.

    The results of these evaluations were reported at conferences during the 1990's where developers and evaluators shared their findings and government specialists described their needs. These conferences were called 'Message Understanding Conferences (MUC)' as a result of the use of such technology to process military messages.

    Institutions that have membership in the LDC during the 2001 Membership Year will be able to receive this corpus free of charge. The non-member cost is $100. Please note that there is also an associated user agreement for both members and nonmembers.

    2. CALLHOME Spanish Dialogue Act Annotation was developed under Project CLARITY. The goal of CLARITY was to glean discourse information from unrestricted conversational speech using shallow corpus-based analysis. The annotation was carried out at Interactive Systems Labs at Carnegie Mellon University.

    This ftp publication used a three level coding scheme to manually tag the LDC publication, CALLHOME Spanish Transcripts: http://www.ldc.upenn.edu/Catalog/LDC96T17.html The three levels of the coding scheme are:

    1. a dialogue act level consisting of a tag set extended from DAMSL and Switchboard

    2. a dialogue game level featuring short sequences of dialogue acts

    3. a genre level similar to topical segments.

    All 120 dialogues have been annotated. This publication contains approximately 11,835 unique words and 211,940 total words.

    Dialogue games are short sequences of dialogue acts such as question/answer pairs. Genres include storytelling, discussion, and planning. Segmentation takes topics into account as well. Genres, games and dialogue acts are annotated by type. Genres are additionally annotated for activities and topics (on a 0-5 scale) for the central object or person being discussed (the 'who' or 'what' category); they contain a short synopsis of the segment.

    Papers on annotation schemes from the 1999 ACL Workshop for Discourse Tagging and LREC-2000 and technical papers on automatic detection are available at the Interactive Systems Labs site: http://www.is.cs.cmu.edu

    Institutions that have membership in the LDC during the 2001 Membership Year will be able to receive this corpus free of charge. The non-member cost is $600

    --

    If you would like to order a copy of this corpus, please email your request to <ldc@ldc.upenn.edu>. User agreements may be faxed to 215.573.2175.

    If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call 215.573.1275.



    This archive was generated by hypermail 2b29 : Wed Jul 25 2001 - 11:14:49 MET DST