Re: Corpora: German corpus

From: LDC Office (ldc@unagi.cis.upenn.edu)
Date: Wed Mar 29 2000 - 21:24:26 MET DST

  • Next message: Tony Berber Sardinha: "Re: Corpora: German corpus"

    Dear Christina,

    The Linguistic Data Consortium (LDC) offers a variety of German
    corpora. We have telephone speech, transcripts, lexicons, and
    newswire text.

    We have two telephone speech collections, CallFriend and CallHome.
    The CallFriend collection consists of 60 unscripted telepone
    conversations lasting between 5 and 30 minutes. The CallHome
    collection consists of 100 telephone conversations lasting up to 30
    minutes each. Transcripts of the CallHome calls are available as
    is a lexicon.

    The CallHome German lexicon consists of 318,807 words and contains
    tab-separated information fields with orthographic, morphological,
    phonological, stress, source, and frequency information for each
    word. 315,503 words from the CallHome German lexicon are adapted
    from the CELEX German lexicon produced by The Centre for Lexical
    Information, which is also distributed through LDC. Celex contains
    information on orthography, phonology, morphology, syntax, and word
    frequency.

    We have two corpora which contain German newstext. ECI
    Multilingual Text consists of roughly 92 million words from 27
    languages. It contains roughly 36 million words in German from
    various news sources. The European Language Newspaper Text
    collection includes roughly 100 million words of French, 90 million
    words of German and 15 million words of Portuguese. Our newstext
    collections our marked using SGML to identify article boundaries.

    For more information on these corpora please visit our Catalog
    search page at

    http://morph.ldc.upenn.edu/Catalog/search.html

    and select the language and/or corpus type in which you are
    interested. Please feel free to contact me with any questions.

    Best regards,

    Shannon Sears
    Manager, Intellectual Property Rights and Membership
    ----------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 898-0464
    3615 Market Street Fax: (215) 573-2175
    Suite 200 email: ssears@ldc.upenn.edu
    Philadelphia, PA 19104-2608 www: http://www.ldc.upenn.edu

    Christina Rosén wrote:

    > HEllo,
    >
    > I am doing research on second language acquisation. Could someone tell me
    > if there is an adequat German corpus available somewhere. Most corpora seem
    > to be English!
    > I would be very grateful for help. Thanks!
    >
    > Best regards
    > Christina Rosén
    > Växjö university
    >
    > ----------------------------------
    > Christina Rosén
    > Inst. för humaniora
    > Växjö universitet
    > 351 95 Växjö
    >
    > Phone +46 470 70 88 55
    > Fax +46 470 75 18 88
    > Phone/Fax +46 470 124 27 (home)



    This archive was generated by hypermail 2b29 : Wed Mar 29 2000 - 21:23:43 MET DST