Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Kevin Patrick Scannell (scannell@slu.edu)
Date: Tue Aug 31 2004 - 19:16:47 MET DST

  • Next message: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"

    On Tuesday 31 August 2004 11:11 am, Mike Maxwell wrote:
    > Mark P. Line wrote:
    > > I've been playing with Google searches for extracting texts in a
    > > particular language from the Web without a lot of noise (i.e. few
    > > texts that aren't in the desired language). Any comments on the
    > > utility of this approach for more serious corpus research?
    >
    > I've been using basically this approach to find websites for a number of
    > languages (Bengali, Tamil, Panjabi, Tagalog, Tigrinya and Uzbek).

     I have (yet another) tool taking essentially the same approach:

    http://borel.slu.edu/crubadan/

    It is based on the Google API, wget, etc. I mentioned it
    on this list sometime in the spring.

      I am planning on releasing the source code as soon as I get a chance
    to tidy things up a bit. The real feature of the program is that
    it can bootstrap the language model from a pretty minimal amount
    of seed text. The queries are generated by automatically by finding
    candidate stopwords from the top of the frequency list (and filtering
    out words near the top of other languages' frequency lists)
    and then randomly adding in words from the rest of the corpus
    "OR"'d together. The crawler is running and collecting text for
    more than 150 languages at the moment:

    http://borel.slu.edu/crubadan/stadas.html

    I have a small army of open source volunteers who are native speakers
    of one or more of the languages helping create spell checking word
    lists and helping to deal with some of the issues that Mike
    Maxwell raised in his message (odd character encodings,
    separating dialects/orthographies, etc.). Mike covered most
    of the important difficulties that arise so I don't have much
    to add other than the offer to answer any questions about the implementation,
    or in fact to run the crawler on behalf on anyone willing
    to send me some seed text in your target language.

    Kevin



    This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 19:27:39 MET DST