RE: [Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3

From: Ute Römer (ute.roemer@uni-koeln.de)
Date: Fri Jun 11 2004 - 17:58:19 MET DST

  • Next message: Tony Rose: "RE: [Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3"

    Dear Tan Siew Imm and others,

    > The problem is that Reuters comprises more than 800,000 XML files but
    >Wordsmith can only process up to 16,368 files. Has anybody ever attempted
    >using Wordsmith Version 3 to analyse Reuters?

    Yes, I experienced the same problem and would also be interested in a way
    around it. So far, I've never really needed to use the whole corpus, so I
    only unzipped some of the archives and work with random parts of the corpus.
    It would be nice to access it in a more systematic way though.

    A possible solution to the problem might be the use of a different
    concordance software. From what I see, corpus size is unlimited with
    MonoConc Pro 2.2, though I am not 100% about the number of individual files
    you can load. WST version 4.0 should also work on a larger (unlimited?)
    number of corpus files.

    A problem with both programs might (or might as well not) be the overall
    size of the corpus. According to my rough-and-dirty counts and
    extrapolations the RC has more than half a billion tokens -- which would
    slow down the more complex searches quite a bit (at least with WST 3.0).
    Btw, have you (or anyone else) done a proper word count of the corpus? (the
    RC distributors told me they hadn't) -- Using MP2.2 would of course be a
    solution to that problem since it does a word count whenever you load a
    corpus anyway.

    Best wishes... Ute

    ************************************************************
     
    Ute Römer
    English Department
    University of Hanover
    Königsworther Platz 1
    30167 Hannover
    Germany
     
    Phone: +49 (0)511 762 2997
    Fax: +49 (0)511 762 2996
    E-mail: ute.roemer@anglistik.uni-hannover.de
    http://www.fbls.uni-hannover.de/angli/



    This archive was generated by hypermail 2b29 : Fri Jun 11 2004 - 18:01:08 MET DST