RE: Corpora: Help please - downloading text from the Web

From: Mark Lewellen (lewellen@EROLS.COM)
Date: Mon Mar 27 2000 - 22:56:28 MET DST

  • Next message: Mark Davies: "Re: Corpora: Help please - downloading text from the Web"

    Also, the Perl modules LWP, HTML, and URI provide tools for
    downloading files off the web, processing them while they're
    being downloaded, extracting hyperlinks and other functions.
    I found this useful for repetitive site-specific tasks in which
    I'd like to filter out some of the files being downloaded.

    Mark Lewellen

    > Subject: Corpora: Help please - downloading text from the Web
    >
    > Hi. Can anyone help me with the following:
    >
    > I'm looking for software - preferably freeware or shareware - to
    > use to download text from Web sites, for use in a corpus.
    >
    > This will be from large sites, with a lot of files, sub-directories
    > and internal links. Most basically, the software would simply download
    > HTML files from the site, following internal links from the Home page.
    > I've tried various "bots" that do this, but have had problems with all
    > of them. So I'd welcome recommendations for software that others have
    > found unproblematic (and powerful/multi-functioned) for this purpose.
    >
    > And if anyone knows of packages that are more specifically aimed at the
    > task I'm undertaking, that would be even better.
    >
    > Also useful would be software that mapped out the structure of
    > sites, giving
    > an idea of the size of the files.
    >

    >
    > I have a related question. What tools do you use once you have downloaded
    > the HTML files to (batch-)convert them in reasonably clean "plain" text?
    >



    This archive was generated by hypermail 2b29 : Tue Mar 28 2000 - 09:13:33 MET DST