Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Tue Aug 31 2004 - 00:33:01 MET DST

  • Next message: Stuart A Yeates: "Re: [Corpora-List] Does On-screen Reading Really Work?"

    Maybe you could extract seeds to be used in new queries from the pages
    you found, as suggested in:

    R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create
    minority language corpora. CIKM 2001, 279–286.
    http://citeseer.ist.psu.edu/ghani01mining.html

    We have a set of simple tools to automatize this kind of procedure
    somewhat (we use it mostly for terminology extraction, but they kind of
    work to create general-purpose monolingual corpora as well):

    http://sslmit.unibo.it/~baroni/bootcat.html

    Regards,

    Marco

    On Monday, Aug 30, 2004, at 22:51 Europe/Rome, Mark P. Line wrote:

    > I've been playing with Google searches for extracting texts in a
    > particular language from the Web without a lot of noise (i.e. few texts
    > that aren't in the desired language). Any comments on the utility of
    > this
    > approach for more serious corpus research? Any improvements to the best
    > search criteria I've been able to come up with below? Any good search
    > criteria for languages not listed?
    >
    > (If there's any interest at all, I'd be happy to collect searches like
    > these on a webpage somewhere.)
    >
    >

    ---
    Marco Baroni
    University of Bologna
    http://sslmit.unibo.it/~baroni
    



    This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 00:36:03 MET DST