[Corpora-List] New LDC Corpus

From: LDC Office (ldc@ldc.upenn.edu)
Date: Thu Jan 30 2003 - 22:58:24 MET

  • Next message: Hyo-Kyung Lee: "Re: [Corpora-List] collocation extraction"

                       * English Gigaword *

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of the English Gigaword corpus.

    English Gigaword is a comprehensive archive of newswire text data
    in English that has been acquired over several years by the LDC. The
    newswire texts are drawn from four international sources:

    Agence France Press English Service
    Associated Press Worldstream English Service
    The New York Times Newswire Service
    The Xinhua News Agency English Service

    English Gigaword is the first LDC publication to be distributed on
    DVD. Much of the content in this collection has been published
    previously by the LDC in a variety of other, older corpora,
    particularly, the North American News text corpora (LDC95T21, LDC98T30),
    the various TDT corpora and the AQUAINT text corpus (LDC2002T31). In
    addition to this previously published data, the English Gigaword corpus
    contains a significant amount of previously unreleased data,
    specifically, all of the Agence France Presse content, the 1995 and
    2001 Xinhua content, and portions of NYT and APW dating from February
    2001 forward.

    All text data are presented in SGML form, using a very simple, minimal
    markup structure; all text consists of printable ASCII and whitespace.
    The text formatting is consistent across all sources. The English
    Gigaword corpus has been fully validated by a standard SGML parser
    utility (nsgmls), using a DTD file which is provided as part of this
    publication.

    For further information, including a link to online documentation,
    please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

    Institutions that have membership in the LDC during the 2003
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may license this publication for $2,500.

                               *
                          
    If you need additional information before placing your order, or
    would like to inquire about membership in the LDC, please send email to
    <ldc@ldc.upenn.edu> or call (215) 573-1275.

    ---------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 email: ldc@ldc.upenn.edu
    Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Jan 30 2003 - 23:00:47 MET