Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Tue Aug 31 2004 - 18:11:28 MET DST

  • Next message: Gloria : "[Corpora-List] Searching BNC for adverbs followed by verb"

    Mark P. Line wrote:

    > I've been playing with Google searches for extracting texts in a
    > particular language from the Web without a lot of noise (i.e. few
    > texts that aren't in the desired language). Any comments on the
    > utility of this approach for more serious corpus research?

    I've been using basically this approach to find websites for a number of
    languages (Bengali, Tamil, Panjabi, Tagalog, Tigrinya and Uzbek).
    Earlier we used this, or something quite similar, for Hindi and Cebuano,
    and I've experimented with it for Tzeltal and Shuar. It is easy to
    extend to other languages; basically, you just look in a dictionary or
    grammar for a few function words. Once I find a website, tools like
    wget will allow you to build a corpus; then you can test whether a given
    file from that site is in the language by various other means. (If the
    language has a specific Unicode range, testing is trivial.)

    You may get some interference with closely related languages. Your
    Tagalog search, for example, might be bringing up pages in other
    Philippine languages. (I don't know that it is, since I don't know
    Tagalog--requiring that 'ang' and 'may' be adjacent probably prevents
    this. If you had left off the leading and trailing quotes, I guess
    there would have been a greater chance of lowering your precision.)

    You can of course do these sorts of searches with the Google API, which
    allows you to semi-automate the downloads. I've done that to find all
    the pages at a given site that are in some language, where techniques
    like 'wget' didn't work.

    More sophosticated methods, i.e. tools like CorpusBuilder, are needed
    when you want to build an exhaustive corpus of some language, and you
    have the time to build a language filter.

    One situation where your approach may not work so well, is when a
    language's websites use multiple character encodings. Unfortunately,
    this is quite common in languages that have non-Roman writing systems,
    such as the Indic languages, or Tigrinya (and I imagine Amharic,
    although I haven't tried it there). For Hindi, which is the worst case
    we've seen yet, virtually every newspaper site had its own proprietary
    (=undocumented) encoding, and one site (the Indian parliament) claimed
    to use five different proprietary encodings. (I'm not sure they really
    did, but they did suggest downloading five different fonts.) The
    multiple character encoding problem doesn't reduce your precision, which
    is what you say you're really interested in, but it will definitely
    reduce your recall. When last I looked, the only Hindi news sites using
    Unicode were the Voice of America and the BBC. There were a number of
    other Hindi websites using Unicode, but they tended to be in countries
    other than India; two that come to mind were a museum in Australia, and
    Colgate. I think there's next to nothing in Tigrinya in Unicode,
    whereas there is a fair amount (I won't say a lot) in other encodings.

    Variant spelling systems can also cause problems. You won't run into
    this for major languages, but you may for recently written languages
    (Mayan and Quechuan languages) or languages of the former Soviet Union
    (Chechen is a case in point). I thought it might be the case with
    Nahuatl, but apparently the c/qu vs. k issue isn't so "hot" for Nahuatl
    languages as they are for some other languages of Latin America.

    The same method can of course be used for non-Unicode non-Roman
    websites; you just have to find some such websites to start with, so you
    know how to spell the words in whatever encoding they're using.

    I recently ran into some bizarre pseudo-Unicode websites in Bengali.
    They use HTML character entities for Unicode codepoints, but not all the
    codepoints are actually in the Bengali section of Unicode--they appear
    to be using other "Unicode" (scare quotes intentional) codepoints for
    contextual variants of characters. BTW, Google treats HTML character
    entities as if they were ordinary Unicode codepoints, which simplifies
    search.

    I gave a talk at the ALLC/ACH meeting in June on our search technique,
    including its pros and cons. The abstract was published, but not the
    full paper. I suppose I should post it somewhere...

    -- 
         Mike Maxwell
         Linguistic Data Consortium
         maxwell@ldc.upenn.edu
    



    This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 18:28:04 MET DST