Corpora: Summary on available syntactically parsed corpora

From: Rene.Valdes@lhsl.com
Date: Thu Aug 16 2001 - 18:40:11 MET DST

  • Next message: Mark Davies: "Corpora: Corpus conferences in the British Isles Jan-July 2002?"

    Dear list members,

    As requested by some of the respondents, I'd like to summarize the
    responses I got to my inquiry on available syntactically parsed (treebank)
    corpora for English, French, German, and other languages. As reflected
    below, there are a few good options for English and German, as well as
    Chinese. However, I did not receive any reply and could not locate any
    such corpus for French. Since we are about to embark on a project that
    would benefit from the availability of such a corpus, I'd really appreciate
    any information about French treebanks of any size and style. And now on
    to the summary:

    1. ICE-GB corpus (British English)

    The ICE-GB corpus is a 1m-word corpus of British English, fully parsed for
    clause & phrase structure. For more info see:
    http://www.ucl.ac.uk/english-usage/ice-gb/index.htm

    Reply from: Dr Gerald Nelson,
              Research Assistant Professor,
              Department of English,
              The University of Hong Kong,
              Pokfulam Road,
              Hong Kong SAR.

              Email: ganelson@hkucc.hku.hk
              Phone: (852) 2241-5141
              Fax: (852) 2559-7139
              http://www.hku.hk/english/staff/ganelson.htm

    2. TIGER project (German)

    In the TIGER project we are creating a large syntactically annotated
    corpus of German newspaper text. A corpus sampler will be released this
    month:
    http://www.ims.uni-stuttgart.de/projekte/TIGER/

    My task is to develop a search tool for syntactically annotated corpora
    - a first beta version will be released in October, the final version in
    November.

    Reply from: Wolfgang Lezius lezius@ims.uni-stuttgart.de
              IMS, University of Stuttgart Tel.: +49 +711 121-1374
              Azenbergstr. 12 Fax: +49 +711 121-1366
              D-70174 Stuttgart
              Germany

    3. NEGRA corpus (German)

    The German ``NEGRA Corpus'', consists of parsed newspaper texts.
    See http://www.coli.uni-sb.de/sfb378/negra-corpus/

    Reply from: Thorsten Brants
              brants@parc.xerox.com

    4. Verbmobil treebanks (German, English, Japanese)

    We could help you with treebanks for English and German (and to some
    degree for Japanese). They were developed in Tuebingen in the framework
    of Verbmobil, a speech-to-speech translation project. For this reason,
    the treebanks contain spontaneous speech data in the domains scheduling
    of business appointments, travel scheduling, and hotel reservations.

    The English treebank contains ca. 30,000 sentences, the German treebank
    ca. 38,000 sentences. The Japanese treebank is somewhat smaller, it
    contains ca. 18,000 sentences. The annotations for all treebanks cover
    the levels of morpho-syntax, syntactic phrase structure, and
    function-argument structure. The annotation schemes are purely
    context-free, i.e. they do not contain crossing branches or traces.

    Additionally, for each treebank, there exists an extensive stylebook,
    which describes how different phenomena are annotated.

    As the treebanks are only becoming available now (due to project
    restrictions), I am not sure what the license conditions for commercial
    use will be.

    Reply from: Sandra Kuebler
              University of Tuebingen
              Computational Linguistics
              Wilhelmstr. 113
              D-72074 Tuebingen
              Germany
              phone: +49-7071-2978490
              fax: +49-7071-551335
              email: kuebler@sfs.nphil.uni-tuebingen.de
              URL: http://www.sfs.nphil.uni-tuebingen.de/~kuebler/

    5. BLLIP99 corpora

    Are you aware of the BLLIP99 corpora distributed by LDC? 30 million
    words of WSJ text, machine parsed and coreferenced.

    Reply from: Eugene Charniak
              ec@bohr.cs.brown.edu

    6. Various links to check

    You may want to check the list archives at:
    http://www.hit.uib.no/corpora/
    In case no one answers.

    Also, the largest collection of corpora I know of is from The Linguistic
    Data Consortium
    http://www.ldc.upenn.edu/

    Chris Manning also has an extensive list of links to corpus resources
    http://www-nlp.stanford.edu/links/statnlp.html#Corpora

    Reply from: Daniel Walker
              Mendez, Inc.
              dwalker@lhsl.com

    7. Chinese Penn Treebank

    This one is also available from LDC and contains about 100K words (4185
    sentences from 325 articles from Xinhua newswire between 1994 and 1998).
    It was parsed following the general methodology of the Penn Treebank. It
    costs $100.
    See http://www.ldc.upenn.edu/Catalog/LDC2000T48.html

    (I obtained this information by looking through the LDC catalog.)

    Again, any information on syntactically parsed French corpora would be
    greatly appreciated.

    René J. Valdés
    Mendez, Inc.
    San Diego, California
    USA
    http://www.mendez.com
    rvaldes@lhsl.com
    1-858-737-5216



    This archive was generated by hypermail 2b29 : Thu Aug 16 2001 - 21:12:42 MET DST