Re: Corpora: Help please - downloading text from the Web

From: Knut Hofland (Knut.Hofland@hit.uib.no)
Date: Mon Mar 27 2000 - 00:45:33 MET DST

  • Next message: Nancy M. Ide: "Corpora: Listing of CES-based Corpus Encoding Projects"

    On Thu, 23 Mar 2000, Geoff Wilkins wrote:

    > I'm looking for software - preferably freeware or shareware - to
    > use to download text from Web sites, for use in a corpus.

    I have used w3mir
    http://www.math.uio.no/~janl/w3mir/
    and
    SiteSnagger
    http://hotfiles.zdnet.com/cgi-bin/texis/swlib/hotfiles/info.html?fcode=000P7Z
    Both have shortcomings, but I have downloaded gigabytes of HTML-files
    with the programs.

    With w3mir (and some home made programs) I have built a fully automatic
    system for downloading all the new articles each day in 10 Norwegian
    newspapers in the Web, stripping HTML-codes, indexing the text (with IMS
    CWB) and making the total text searchable through a Web-browser (with a
    passwd due to copyright reasons). I will present this project at LREC in
    Athens later this year.

    Knut Hofland | Knut.Hofland@hit.uib.no
    HIT-Centre (former NCCH) | http://www.hit.uib.no/knut/
    University of Bergen, | Phone: +47 5558 9463
    Allegt. 27, N-5007 Bergen, Norway | Fax: +47 5558 9470



    This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 00:47:08 MET DST