Re: Corpora: Help please - downloading text from the Web

From: Knut Hofland (Knut.Hofland@hit.uib.no)
Date: Mon Mar 27 2000 - 00:45:33 MET DST

Next message: Nancy M. Ide: "Corpora: Listing of CES-based Corpus Encoding Projects"

Previous message: KORTERM: "Corpora: 2nd CFP: Terminology Resources and Computation, Due: 31/March/2000"
In reply to: Geoff Wilkins: "Corpora: Help please - downloading text from the Web"
Next in thread: Dave Braze: "Re: Corpora: Help please - downloading text from the Web"
Next in thread: Christian Coseru: "Re: Corpora: Help please - downloading text from the Web"
Reply: Dave Braze: "Re: Corpora: Help please - downloading text from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 23 Mar 2000, Geoff Wilkins wrote:

> I'm looking for software - preferably freeware or shareware - to
> use to download text from Web sites, for use in a corpus.

I have used w3mir
http://www.math.uio.no/~janl/w3mir/
and
SiteSnagger
http://hotfiles.zdnet.com/cgi-bin/texis/swlib/hotfiles/info.html?fcode=000P7Z
Both have shortcomings, but I have downloaded gigabytes of HTML-files
with the programs.

With w3mir (and some home made programs) I have built a fully automatic
system for downloading all the new articles each day in 10 Norwegian
newspapers in the Web, stripping HTML-codes, indexing the text (with IMS
CWB) and making the total text searchable through a Web-browser (with a
passwd due to copyright reasons). I will present this project at LREC in
Athens later this year.

Knut Hofland | Knut.Hofland@hit.uib.no
HIT-Centre (former NCCH) | http://www.hit.uib.no/knut/
University of Bergen, | Phone: +47 5558 9463
Allegt. 27, N-5007 Bergen, Norway | Fax: +47 5558 9470

Next message: Nancy M. Ide: "Corpora: Listing of CES-based Corpus Encoding Projects"
Previous message: KORTERM: "Corpora: 2nd CFP: Terminology Resources and Computation, Due: 31/March/2000"
In reply to: Geoff Wilkins: "Corpora: Help please - downloading text from the Web"
Next in thread: Dave Braze: "Re: Corpora: Help please - downloading text from the Web"
Next in thread: Christian Coseru: "Re: Corpora: Help please - downloading text from the Web"
Reply: Dave Braze: "Re: Corpora: Help please - downloading text from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 00:47:08 MET DST