Re: [Corpora-List] Looking for super large Russian corpus

From: resnik@umiacs.umd.edu
Date: Thu Oct 28 2004 - 16:22:17 MET DST

  • Next message: Adam Kilgarriff: "[Corpora-List] Regular expression exercises"

    Well, since you asked for a huge corpus... In case it might be
    useful, we have created a a very large file (122M compressed, 1.8G
    uncompressed) containing over 25 million URLs, collected from the
    Internet Archive (www.archive.org), for pages that were identified as
    Russian by automatic language ID. Some percentage of the URLs will be
    stale, of course, and language ID is not perfect, but a large
    percentage of the pages should still be out there and the language
    identification is pretty accurate. You can download any subset of the
    URLs you want, convert to plain text, apply your own stricter language
    ID if you'd like, and, voila, a huge collection of Russian text.

    The URL list is available from the STRAND download page,
    http://umiacs.umd.edu/~resnik/strand/ under "Monolingual Russian".

      Philip

      ----------------------------------------------------------------
      Philip Resnik, Associate Professor
      Department of Linguistics and Institute for Advanced Computer Studies

      1401 Marie Mount Hall UMIACS phone: (301) 405-6760
      University of Maryland Linguistics phone: (301) 405-8903
      College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
      http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu



    This archive was generated by hypermail 2b29 : Thu Oct 28 2004 - 16:40:11 MET DST