[Corpora-List] New Data from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Jan 06 2005 - 20:29:53 MET

  • Next message: William Fletcher: "[Corpora-List] UN Documents Online = Massive Parallel Text Collection"

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of three (3) new databases.

    ------------------------------------------------------------------------

    (1) The Buckwalter Arabic Morphological Analyzer Version 2.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02>
    consists primarily of three Arabic-English lexicon files: prefixes (299
    entries), suffixes (618 entries), and stems (82158 entries representing
    38600 lemmas). The lexicons are supplemented by three morphological
    compatibility tables used for controlling prefix-stem combinations (1648
    entries), stem-suffix combinations (1285 entries), and prefix-suffix
    combinations (598 entries). The documentation consists of a readme file
    with a description of the lexicon files, the morphological compatibility
    tables, the morphology analysis algorithm, a summary of stem
    morphological categories, and a table with the author's Arabic
    transliteration system.

    Institutions that have membership in the LDC for the Membership Year
    (MY) 2004 will be able to receive this corpus free of charge. Please
    note that this corpus is designated 'Members Only' and is, therefore,
    not available for nonmember licensing. You can find information on
    becoming an LDC member at our Members FAQ
    <http://www.ldc.upenn.edu/Membership/FAQ_Members.shtml>.

    *

    (2) Fisher English Training Speech Part 1 Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S13>
    represents the first half of a collection of conversational telephone
    speech (CTS) that was created at the LDC during 2003. It contains 5850
    audio files, each one containing a full conversation of up to 10
    minutes. The individual audio files are presented in NIST SPHERE
    format, and contain two-channel mu-law sample data; "shorten"
    compression has been applied to all files. Fisher English Training
    Speech Part 1 Speech is distributed on seven DVD-ROM.

    Institutions that have membership in the LDC for the Membership Year
    (MY) 2004 will be able to receive this corpus free of charge. Nonmembers
    may license this corpus for US$7000.

    *

    (3) Fisher English Training Speech Part 1 Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T19>
    represents the first half of a collection of conversational telephone
    speech (CTS) that was created at the LDC. It contains transcript data
    for 5850 complete conversations, each lasting up to 10 minutes. In
    addition to the transcriptions, there is a complete set of tables
    describing the speakers, the properties of the telephone calls, and the
    set of topics that were used to initiate the conversations. Fisher
    English Training Speech Part I Transcripts is distributed on one CD-ROM.

    Institutions that have membership in the LDC for the Membership Year
    (MY) 2004 will be able to receive this corpus free of charge. Nonmembers
    may license this corpus for US$1000.

    ------------------------------------------------------------------------

    For further information on LDC data, please visit our online catalog
    <http://www.ldc.upenn.edu/Catalog/>. Should you have any questions
    concerning the licensing of data or if you are interested in membership
    to the LDC, please call +1 215 573 1275 or email ldc@ldc.upenn.edu.

            --------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Jan 06 2005 - 21:01:45 MET