RE: [Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3

From: Tony Rose (tr@acl.icnet.uk)
Date: Fri Jun 11 2004 - 18:22:42 MET DST

  • Next message: Ute Römer: "RE: [Corpora-List] Analysing Reuters Corpus Using Wordsmith Version 3"

    > A problem with both programs might (or might as well not) be the overall
    > size of the corpus. According to my rough-and-dirty counts and
    > extrapolations the RC has more than half a billion tokens -- which would
    > slow down the more complex searches quite a bit (at least with WST 3.0).
    > Btw, have you (or anyone else) done a proper word count of the
    > corpus? (the
    > RC distributors told me they hadn't) -- Using MP2.2 would of course be a
    > solution to that problem since it does a word count whenever you load a
    > corpus anyway.

    FYI you can find lots more statistics on the corpus at:

    http://about.reuters.com/researchandstandards/corpus/statistics/index.asp

    and many pre-processed versions of the raw data are linked from Dave Lewis's
    web page, e.g.

    http://www.daviddlewis.com/resources/testcollections/rcv1/

    Cheers,
    Tony



    This archive was generated by hypermail 2b29 : Fri Jun 11 2004 - 18:37:46 MET DST