Re: Corpora: Help please - downloading text from the Web

From: Christian Coseru (christian.coseru@anu.edu.au)
Date: Mon Mar 27 2000 - 08:17:13 MET DST

Next message: Thorsten Brants: "Corpora: LINC-2000"

Previous message: Nancy M. Ide: "Corpora: Listing of CES-based Corpus Encoding Projects"
Maybe in reply to: Geoff Wilkins: "Corpora: Help please - downloading text from the Web"
Next in thread: Andrew Harley: "Re: Corpora: Help please - downloading text from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

At 11:34 AM 3/23/00 GMT, you wrote:
>
>Hi. Can anyone help me with the following:
>
>I'm looking for software - preferably freeware or shareware - to
>use to download text from Web sites, for use in a corpus.
>Geoff Wilkins

By far the best spider (I have tested over a dozen commercialware and
shareware) is httrack
developed by Xavier Roche and Yann Philippot at CERN. The software if
freeware and is available for Unix, Linux, Solaris and Windows platforms. I
have archived sites up to 250MB in size and over 40000 files with no
difficulty at all. The spider is highly customizable, has extensive support
for JavaScript and can easily gather dynamic or database driven (e.g. asp,
cfm) web sites.

The software and the documentation can be found at http://httrack.free.fr

Christian Coseru

Next message: Thorsten Brants: "Corpora: LINC-2000"
Previous message: Nancy M. Ide: "Corpora: Listing of CES-based Corpus Encoding Projects"
Maybe in reply to: Geoff Wilkins: "Corpora: Help please - downloading text from the Web"
Next in thread: Andrew Harley: "Re: Corpora: Help please - downloading text from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 08:12:07 MET DST