Re: Corpora: Using SARA to query other corpora than the BNC

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Fri Aug 03 2001 - 21:02:26 MET DST

  • Next message: F. Peng: "Corpora: POS tag for word segmentation and Chinese POS tagger"

    So here comes an extremely late reply, and that with my being about to
    go on holiday ...

       The BNC sampler disk we made a few years ago was intended to
       provoke some informed discussion of the relative strengths of a
       variety of what were regarded then as state of the art corpus
       access systems (it included Wordsmith, SARA, CWB, and Qwick) when
       handling SGML marked up corpus data. If such discussion has
       happened, I seem to have missed it. Ah well.

    I was going to say that now, perhaps, we have an opportunity to start
    such a discussion; but having myself taken more than a month to write
    an answer my hopes for a lively discussion aren't that high any more.

       I share your high opinion of Corpus Work Bench by the way; it is
       indeed an excellent piece of software.

    Thanks for the praise. Of course, as with every piece of software, the
    next version is going to be much better. Which brings me back to the
    shameless plug I put into my last e-mail:

       ``Hope you don't mind the plug: the IMS Corpus Workbench was
       designed for corpora of that size and offers both (relatively)
       compact storage and (relatively) efficient access (it isn't
       available for HP-UX either, though).''

    inviting Adam Kilgarriff's riposte

       ah, but that invites the repost "when?!?!" (for a new interface)

    It seems that the only way of getting the new release out at last is
    to commit myself publicly to a deadline. So here goes: the new version
    of the IMS Corpus Workbench will be released around end of September
    (2001 -- I shouldn't leave myself any loopholes :o). The version
    number is going to be 3.0 as we skipped version 2.3 that we had meant
    to release about 2 years ago.

    I hope to get many of you interested in the new release (precompiled
    binaries for SUN Solaris and x86-Linux only), and thus make another
    attempt to stir up a discussion about corpus access software.

       I don't think it is much better than SARA with respect to disk
       space usage however --

    I haven't had a close look at how much disk space SARA uses, but I can
    give you some figures for the CWB, which at least allow a comparison
    with plain XML files. For a (German) 40 million token corpus without
    annotations and XML-style markup the CWB binary format requires about
    150 MB of disk space (using compression), including the index files.
    The same text in plain ASCII (ISO-8859-1, to be precise) encoding
    takes up more than 240 MB, and an XML format would increase the size
    even further. Even when the ASCII text is compressed with GZip, it is
    still 97 MB large -- and that doesn't give you an index.

       both systems are able to give good performance (for retrieval
       purposes) because both systems make optimised external index files.
       You have to add those into the equation if you are talking about
       efficiency, surely.

    When I talk about XML data format, I usually assume that there are no
    external index files; but that may be an attitude that many of you do
    not share.

       And the last time I looked at it, CWB was less able to make use of
       the SGML markup in a corpus than SARA is.

    The new version will be much better in that respect. However, to be
    fair one has to admit that it still requires a certain amount (and the
    right kind) of preprocessing to make the information from the SGML
    markup readily available in corpus queries.

    Kind regards,
    Stefan.

    -- 
    ``I could probably subsist for a decade or more on the food energy
      that I have thriftily wrapped around various parts of my body.''
                                                    -- Jeffrey Steingarten
    ______________________________________________________________________
    C.E.R.T. Marbach                         (CQP Emergency Response Team)
    http://www.ims.uni-stuttgart.de/~evert                  schtepf@gmx.de
    



    This archive was generated by hypermail 2b29 : Fri Aug 03 2001 - 20:57:22 MET DST