Re: [Corpora-List] Looking for super large Russian corpus

From: resnik@umiacs.umd.edu
Date: Thu Oct 28 2004 - 16:22:17 MET DST

Next message: Adam Kilgarriff: "[Corpora-List] Regular expression exercises"

Previous message: Ute Römer: "[Corpora-List] COBUILD back online..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Well, since you asked for a huge corpus... In case it might be
useful, we have created a a very large file (122M compressed, 1.8G
uncompressed) containing over 25 million URLs, collected from the
Internet Archive (www.archive.org), for pages that were identified as
Russian by automatic language ID. Some percentage of the URLs will be
stale, of course, and language ID is not perfect, but a large
percentage of the pages should still be out there and the language
identification is pretty accurate. You can download any subset of the
URLs you want, convert to plain text, apply your own stricter language
ID if you'd like, and, voila, a huge collection of Russian text.

The URL list is available from the STRAND download page,
http://umiacs.umd.edu/~resnik/strand/ under "Monolingual Russian".

Philip

  ----------------------------------------------------------------
  Philip Resnik, Associate Professor
  Department of Linguistics and Institute for Advanced Computer Studies

  1401 Marie Mount Hall UMIACS phone: (301) 405-6760
  University of Maryland Linguistics phone: (301) 405-8903
  College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
  http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu

Next message: Adam Kilgarriff: "[Corpora-List] Regular expression exercises"
Previous message: Ute Römer: "[Corpora-List] COBUILD back online..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Oct 28 2004 - 16:40:11 MET DST