[Corpora-List] language-specific harvesting of texts from the Web

From: Mark P. Line (mark@polymathix.com)
Date: Mon Aug 30 2004 - 22:51:02 MET DST

  • Next message: Marco Baroni: "Re: [Corpora-List] language-specific harvesting of texts from the Web"

    I've been playing with Google searches for extracting texts in a
    particular language from the Web without a lot of noise (i.e. few texts
    that aren't in the desired language). Any comments on the utility of this
    approach for more serious corpus research? Any improvements to the best
    search criteria I've been able to come up with below? Any good search
    criteria for languages not listed?

    (If there's any interest at all, I'd be happy to collect searches like
    these on a webpage somewhere.)

    Examples:

    Basque:
    http://www.google.com/search?q=gandik+gana&ie=utf-8&oe=utf-8

    Bislama/Pijin:
    http://www.google.com/search?q=blong+stap&ie=utf-8&oe=utf-8

    Catalan:
    http://www.google.com/search?q=els+uns+unes&ie=utf-8&oe=utf-8

    Indonesian
    http://www.google.com/search?q=tidak+yang+karena&ie=utf-8&oe=utf-8

    Letzebuergesch:
    http://www.google.com/search?q=fir+eng+dat&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

    Malay:
    http://www.google.com/search?q=tidak+yang+kerana&ie=utf-8&oe=utf-8

    Malay/Indonesian:
    http://www.google.com/search?q=tidak+yang&ie=utf-8&oe=utf-8

    Mongolian:
    http://www.google.com/search?q=%D0%B1%D0%B0%D0%B9%D0%BD%D0%B0+&ie=utf-8&oe=utf-8

    Nahuatl:
    http://www.google.com/search?q=auh+inic&ie=utf-8&oe=utf-8

    North Frisian:
    http://www.google.com/search?q=%C3%BC%C3%BCb+m%C3%A4+uun&ie=utf-8&oe=utf-8

    Saami:
    http://www.google.com/search?q=atte+son+ja+dat&ie=utf-8&oe=utf-8

    Shona:
    http://www.google.com/search?q=kusvika&ie=utf-8&oe=utf-8

    Sorbian:
    http://www.google.com/search?q=%C5%A1to%C5%BE&ie=utf-8&oe=utf-8

    Swahili:
    http://www.google.com/search?q=ya+ni+katika&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

    Tagalog:
    http://www.google.com/search?q=%22ang+mga%22&ie=utf-8&oe=utf-8

    Tok Pisin:
    http://www.google.com/search?q=long+bilong&&ie=utf-8&oe=utf-8

    Welsh:
    http://www.google.com/search?q=cymraeg+mae&ie=utf-8&oe=utf-8

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Mon Aug 30 2004 - 23:10:12 MET DST