Re: [Corpora-List] Corpus Mining

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Tue Dec 07 2004 - 19:57:24 MET

  • Next message: MARC FRYD: "[Corpora-List] automatic search for orthographic recurring patterns"

    Hi there.

    The CorpusBuilder tool was the main inspiration for BootCaT:

    http://www-2.cs.cmu.edu/~TextLearning/corpusbuilder/

    It is intended for the collection of texts in a specific language,
    rather than about a specific topic, but I suppose it could be tweaked
    to look for specialized texts.

    CorpusBuilder was (is?) part of a larger project about acquiring
    knowledge from the web:

    http://www-2.cs.cmu.edu/~webkb/

    An Crúbadán is another tool for language-specific web-corpus mining,
    that perhaps could be tweaked to sub-language mining:

    http://borel.slu.edu/crubadan/

    Somewhat relevant is also the notion of ``focused crawling'' in
    information retrieval, see e.g.

    http://www8.org/w8-papers/5a-search-query/crawling/

    Regards,

    Marco
    >
    >

    ---
    Marco Baroni
    University of Bologna
    http://sslmit.unibo.it/~baroni
    



    This archive was generated by hypermail 2b29 : Wed Dec 08 2004 - 08:28:25 MET