Re: Corpora: Help please - downloading text from the Web

From: Christian Coseru (christian.coseru@anu.edu.au)
Date: Mon Mar 27 2000 - 08:17:13 MET DST

  • Next message: Thorsten Brants: "Corpora: LINC-2000"

    At 11:34 AM 3/23/00 GMT, you wrote:
    >
    >Hi. Can anyone help me with the following:
    >
    >I'm looking for software - preferably freeware or shareware - to
    >use to download text from Web sites, for use in a corpus.
    >Geoff Wilkins

    By far the best spider (I have tested over a dozen commercialware and
    shareware) is httrack
    developed by Xavier Roche and Yann Philippot at CERN. The software if
    freeware and is available for Unix, Linux, Solaris and Windows platforms. I
    have archived sites up to 250MB in size and over 40000 files with no
    difficulty at all. The spider is highly customizable, has extensive support
    for JavaScript and can easily gather dynamic or database driven (e.g. asp,
    cfm) web sites.

    The software and the documentation can be found at http://httrack.free.fr

    Christian Coseru



    This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 08:12:07 MET DST