RE: [Corpora-List] Newspaper Corpora

From: Tony Rose (tr@acl.icnet.uk)
Date: Mon Apr 14 2003 - 17:09:40 MET DST

  • Next message: cyrille: "[Corpora-List] about French corpus & tools"

    You could also try the Reuters Corpus:

    http://about.reuters.com/researchandstandards/corpus/

    It's an archive of some 800,000 English language news stories, is freely
    available, and marked up in XML (NewsML in fact).

    Regards,
    Tony
      -----Original Message-----
      From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
    Behalf Of Jan Strunk
      Sent: 14 April 2003 15:16
      To: CORPORA@HIT.UIB.NO
      Subject: [Corpora-List] Newspaper Corpora

      Hello,

      I would like to evaluate a sentence boundary
      and abbreviation detection algorithm on as
      many different languages as possible.
      Therefore, I am searching for newspaper corpora
      that are either freely avaible or not too expensive.

      The languages in question should use the period
      as an ambiguous token denoting either a sentence
      boundary, an abbreviation or both.

      I am already using parts of the Wall Street Journal Corpus,
      the Neue Zürcher Zeitung and some corpora
      included in the Multilingual Corpus I from the European Corpus Initiative.
      I also know about TRACTOR.

      I would be very thankful for any suggestions.

      Best regards,

      Jan Strunk
      strunk@linguistics.ruhr-uni-bochum.de
      Sprachwissenschaftliches Institut
      Ruhr-Universität Bochum
      Germany



    This archive was generated by hypermail 2b29 : Mon Apr 14 2003 - 17:10:36 MET DST