Re: Corpora: Using SARA to query other corpora than the BNC (fwd)

From: Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Date: Sat Jun 23 2001 - 18:21:13 MET DST

  • Next message: TSD 2001 Conference: "Corpora: TSD 2001 - Call for participation"

    When we talk about "efficiency" we usually refer to the performance of
    some activity. A format which is efficient for one purpose/activity (such
    as fast retrieval of context) is generally not efficient for another (such
    as inter platform communication). This is hardly a new idea!

    I share your high opinion of Corpus Work Bench by the way; it is indeed an
    excellent piece of software. I don't think it is much better than SARA
    with respect to disk space usage however -- both systems are able to give
    good performance (for retrieval purposes) because both systems make
    optimised external index files. You have to add those into the equation if
    you are talking about efficiency, surely. And the last time I looked at
    it, CWB was less able to make use of the SGML markup in a corpus than SARA
    is. (But as a compensating strength it includes an efficient indexing
    algorithm for POS marks which SARA didnt). Another major difference is
    that the SARA system retains the original text files as well as the index,
    whereas I believe CWB discards the text. This certainly reduces the
    overall system size, but the price is that some information in the source
    text has to be lost.

    The BNC sampler disk we made a few years ago was intended to provoke some
    informed discussion of the relative strengths of a variety of what were
    regarded then as state of the art corpus access systems (it included
    Wordsmith, SARA, CWB, and Qwick) when handling SGML marked up corpus data.
    If such discussion has happened, I seem to have missed it. Ah well.

    Lou

    On Fri, 22 Jun 2001, Stefan Evert wrote:

    >
    > > Meta languages are ideal for interchange purposes but I doubt
    > > that ANY software will handle SGML data describing 100 million
    > > annotated word forms efficiently. But that's another story.
    >
    > On what grounds do you make this assertion? I suppose it all
    > depends what you mean by "handle efficiently", but it's simply not
    > true that NO software can handle SGML data on that scale.
    >
    > Perhaps he should have written "raw SGML data", in which case I will
    > absolutely second that opinion. All XML encodings that I have seen so
    > far waste more space (in terms of characters) on markup than on the
    > actual data. An XML-encoded version of a 100 million word corpus (with
    > PoS and lemma annotations) will usually take up several gigabytes of
    > disk space.
    >
    > Of course, the corpus size can be drastically reduced with standard
    > compression alogrithms (gzip or bzip2), but the compressed corpus
    > cannot be accessed efficiently.
    >
    > And what
    > would you advocate as an alternative?
    >
    > Hope you don't mind the plug: the IMS Corpus Workbench was designed
    > for corpora of that size and offers both (relatively) compact storage
    > and (relatively) efficient access (it isn't available for HP-UX either,
    > though).
    >
    > Regards,
    > Stefan.
    >
    > --
    > ``I could probably subsist for a decade or more on the food energy
    > that I have thriftily wrapped around various parts of my body.''
    > -- Jeffrey Steingarten
    > ______________________________________________________________________
    > C.E.R.T. Marbach (CQP Emergency Response Team)
    > http://www.ims.uni-stuttgart.de/~evert schtepf@gmx.de
    >
    >



    This archive was generated by hypermail 2b29 : Sat Jun 23 2001 - 18:16:23 MET DST