Re: Corpora: Using SARA to query other corpora than the BNC

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Fri Jun 22 2001 - 19:46:47 MET DST

  • Next message: Lou Burnard: "Re: Corpora: <kw> and </kw> (fwd)"

    > Meta languages are ideal for interchange purposes but I doubt
    > that ANY software will handle SGML data describing 100 million
    > annotated word forms efficiently. But that's another story.

       On what grounds do you make this assertion? I suppose it all
       depends what you mean by "handle efficiently", but it's simply not
       true that NO software can handle SGML data on that scale.

    Perhaps he should have written "raw SGML data", in which case I will
    absolutely second that opinion. All XML encodings that I have seen so
    far waste more space (in terms of characters) on markup than on the
    actual data. An XML-encoded version of a 100 million word corpus (with
    PoS and lemma annotations) will usually take up several gigabytes of
    disk space.

    Of course, the corpus size can be drastically reduced with standard
    compression alogrithms (gzip or bzip2), but the compressed corpus
    cannot be accessed efficiently.

       And what
       would you advocate as an alternative?

    Hope you don't mind the plug: the IMS Corpus Workbench was designed
    for corpora of that size and offers both (relatively) compact storage
    and (relatively) efficient access (it isn't available for HP-UX either,
    though).

    Regards,
    Stefan.

    -- 
    ``I could probably subsist for a decade or more on the food energy
      that I have thriftily wrapped around various parts of my body.''
                                                    -- Jeffrey Steingarten
    ______________________________________________________________________
    C.E.R.T. Marbach                         (CQP Emergency Response Team)
    http://www.ims.uni-stuttgart.de/~evert                  schtepf@gmx.de
    



    This archive was generated by hypermail 2b29 : Fri Jun 22 2001 - 19:42:01 MET DST