Re: [Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3

From: Mike Scott (mike@lexically.net)
Date: Mon Jun 14 2004 - 10:13:50 MET DST

  • Next message: Leonel Ruiz Miyares (Centro Ling. Aplicada): "[Corpora-List] Reminder IX Symposium on Linguistics in Cuba"

    Dear All

    At 09:15 11/06/2004, Siew Imm Tan wrote:
    >
    >I am interested in analysing the Reuters Corpus using Wordsmith Tools
    >Version 3. The problem is that Reuters comprises more than 800,000 XML
    >files but Wordsmith can only process up to 16,368 files. Has anybody ever
    >attempted using Wordsmith Version 3 to analyse Reuters? If so, how do you
    >go around this particular limitation? Is it possible to merge the 800,000
    >Reuters files into 16,000 files or so?
    >

    Yes, WordSmith 3 can only handle 16,000 text files or so, and yes, it might
    manage the job in theory (I haven't tried) if the number of files were
    reduced by gluing one text to another until the whole corpus were reduced
    to 16,000 files. WordSmith 4 has no pre-set limit on the number of text files.

    However, a really huge corpus will certainly be time-consuming to process
    in either version since some tasks are computed in memory. In versions 3
    and 4, the WordList procedure stores each new word-form in memory and every
    time another token of the same word-form type is encountered, adds to the
    frequency information stored with the word-form. So if there are huge
    numbers of different word forms, the PC will slow down considerably when
    its RAM is exhausted, at which point Windows starts to store information in
    a so-called "swap-file" on the hard disk.

    In the case of concordancing, each time a hit is found, the concordance
    line and some other bits of data are stored "on the fly" as in wordlisting.
    So if there are lots of hits as in the case of a common word-form there
    will be lots of RAM used in storing these concordance lines and eventually
    the processing will be slowed somewhat.

    In practice, this means that it is easier to process huge corpora by doing
    the work in chunks, e.g. making separate wordlists of different parts of
    the corpus and later merging them. For concordancing, in most cases there
    is a cut-off point imposed by time; users aren't prepared to wait more than
    say 1 or 2 minutes for results. A solution is to make an index of the
    corpus. That is how the CoBuild project tackled the problem, by avoiding
    doing work "on the fly" and using a standard index which is lengthy to
    build and hard to edit but which once made "knows" about each word-form in
    the whole corpus. Google uses something similar -- when you submit a
    request it doesn't search the Internet but searches only its own index.

    As corpora get bigger, this problem gets harder to solve. WordSmith 3 was
    designed to handle corpora up to the BNC (100 million words) in size but in
    on the fly processing had considerable difficulty with the whole BNC;
    WordSmith 4 handles the whole BNC much more easily but wasn't really
    designed to tackle corpora as big as Reuters on the fly. There's a
    trade-off between having a fixed corpus (which is best indexed) and a
    corpus which isn't so fixed (eg. only of certain parts of the BNC, or of
    one's current stock of student EFL writings).

    Mike Scott

    Applied English Language Studies Unit
    University of Liverpool
    Liverpool L69 3BX, UK.

    Mike.Scott@liv.ac.uk
    http://www.lexically.net
    http://www.liv.ac.uk/~ms2928



    This archive was generated by hypermail 2b29 : Mon Jun 14 2004 - 10:23:34 MET DST