Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Tue Aug 31 2004 - 00:33:01 MET DST

Next message: Stuart A Yeates: "Re: [Corpora-List] Does On-screen Reading Really Work?"

Previous message: Mark P. Line: "[Corpora-List] language-specific harvesting of texts from the Web"
In reply to: Mark P. Line: "[Corpora-List] language-specific harvesting of texts from the Web"
Next in thread: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Maybe you could extract seeds to be used in new queries from the pages
you found, as suggested in:

R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create
minority language corpora. CIKM 2001, 279–286.
http://citeseer.ist.psu.edu/ghani01mining.html

We have a set of simple tools to automatize this kind of procedure
somewhat (we use it mostly for terminology extraction, but they kind of
work to create general-purpose monolingual corpora as well):

http://sslmit.unibo.it/~baroni/bootcat.html

Regards,

Marco

On Monday, Aug 30, 2004, at 22:51 Europe/Rome, Mark P. Line wrote:

> I've been playing with Google searches for extracting texts in a
> particular language from the Web without a lot of noise (i.e. few texts
> that aren't in the desired language). Any comments on the utility of
> this
> approach for more serious corpus research? Any improvements to the best
> search criteria I've been able to come up with below? Any good search
> criteria for languages not listed?
>
> (If there's any interest at all, I'd be happy to collect searches like
> these on a webpage somewhere.)
>
>

---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

Next message: Stuart A Yeates: "Re: [Corpora-List] Does On-screen Reading Really Work?"
Previous message: Mark P. Line: "[Corpora-List] language-specific harvesting of texts from the Web"
In reply to: Mark P. Line: "[Corpora-List] language-specific harvesting of texts from the Web"
Next in thread: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 00:36:03 MET DST