Re: Corpora: Help please - downloading text from the Web

From: Mark Davies (mdavies@ilstu.edu)
Date: Mon Mar 27 2000 - 18:37:00 MET DST

  • Next message: betty@cogsci.ed.ac.uk: "Corpora: Research Fellow in Probabilistic Natual Language Processing"

    Here's some suggestions on components in creating large corpora from
    web-based materials that you might find useful. I've used these to create
    35,000,000 and 25,000,000 word corpora of Spanish and Portuguese newspapers
    (respectively) (http://mdavies.for.ilstu.edu/personal/texts.htm). I also
    presented a paper detailing some of the steps in creating large
    multi-million word web-based corpora at the "North American Symposium on
    Corpora in Linguistics and Language Teaching" at the Univ. of Michigan in
    May 1999, and would be happy to send the handout from that talk to anyone
    who is interested.

    I'm sure that everyone has their own system and preferred software, but
    here's mine:

    DOWNLOADING
    Re. tools for downloading, I've been using Grab-A-Site
    (http://www.bluesquirrel.com). One of the nice features of this program
    (which may be shared by others; I'm not sure) is that you can maintain the
    directory structure of the site from which you're downloading. This is
    particularly useful in the case of newspapers, where you can store
    different days in different days or different sections of the newspaper in
    separate directories. Several times I've set things up to download 5-6
    newspapers during the night, and come back to find 100-150MB of files
    waiting patiently for me -- it's really been nice..

    HTML to ASCII
    Re. converting HTML to ASCII, I've found HTMASC32
    (http://www.bitenbyte.com/index.htm) to work very nicely. I've converted
    up to 5000 HTML files at one time, as well as single 20MB HTML files
    (created by concatenating thousands of smaller webpages), and it's never
    had any problem.

    MACROS, BATCH FILES, ETC.
    I'd also recommend a nice text editor that can do macros, including
    conditional looping. You'll want something like this to clean up the text
    files, even after the HMTL to ASCII conversion. To do these macros, I use
    the old tried-and-true WP 5.1 for DOS, which has a very nice macro language
    and can handle files up to 10MB without much problem. Of course it's a DOS
    program, so there are problems with 8.3 filenames, etc. In addition, you'll
    want to come up to speed (if not already there) on batch files (and using
    macros to create these). When you're dealing with hundreds of thousands of
    files, you need some way to automatize file manipulation.

    Anyway, just my .02 worth.

    Mark Davies

    =======================================
    Mark Davies, Associate Professor, Spanish Linguistics
    Dept. of Foreign Languages, Illinois State University
    Normal, IL 61790-4300

    Voice:309/438-7975 email:mdavies@ilstu.edu
    Fax:309/438-8038 http://mdavies.for.ilstu.edu/personal/
    =======================================



    This archive was generated by hypermail 2b29 : Tue Mar 28 2000 - 09:13:35 MET DST