Well, since you asked for a huge corpus... In case it might be
useful, we have created a a very large file (122M compressed, 1.8G
uncompressed) containing over 25 million URLs, collected from the
Internet Archive (www.archive.org), for pages that were identified as
Russian by automatic language ID. Some percentage of the URLs will be
stale, of course, and language ID is not perfect, but a large
percentage of the pages should still be out there and the language
identification is pretty accurate. You can download any subset of the
URLs you want, convert to plain text, apply your own stricter language
ID if you'd like, and, voila, a huge collection of Russian text.
The URL list is available from the STRAND download page,
http://umiacs.umd.edu/~resnik/strand/ under "Monolingual Russian".
Philip
----------------------------------------------------------------
Philip Resnik, Associate Professor
Department of Linguistics and Institute for Advanced Computer Studies
1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu
This archive was generated by hypermail 2b29 : Thu Oct 28 2004 - 16:40:11 MET DST