Hi there.
The CorpusBuilder tool was the main inspiration for BootCaT:
http://www-2.cs.cmu.edu/~TextLearning/corpusbuilder/
It is intended for the collection of texts in a specific language,
rather than about a specific topic, but I suppose it could be tweaked
to look for specialized texts.
CorpusBuilder was (is?) part of a larger project about acquiring
knowledge from the web:
http://www-2.cs.cmu.edu/~webkb/
An Crúbadán is another tool for language-specific web-corpus mining,
that perhaps could be tweaked to sub-language mining:
http://borel.slu.edu/crubadan/
Somewhat relevant is also the notion of ``focused crawling'' in
information retrieval, see e.g.
http://www8.org/w8-papers/5a-search-query/crawling/
Regards,
Marco
>
>
--- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni
This archive was generated by hypermail 2b29 : Wed Dec 08 2004 - 08:28:25 MET