Re: [Corpora-List] On tools for indexing and searching large corpora

From: Olonichev Sergei (olonichev@scnsoft.com)
Date: Wed Nov 20 2002 - 10:51:28 MET

  • Next message: Amy Neale: "[Corpora-List] Short Course: Corpus Design and Use"

    Agree with Arne Fitschen, and the source code of the system is probel
    available.
    It was used for indexing 300+ million word corpus of English and showed a
    pretty good performance.
    It can be compiled withot any problems under Linux and Cygwin.

    BR,
    Sergei

    ----- Original Message -----
    From: "Arne Fitschen" <fitschen@ims.uni-stuttgart.de>
    To: <corpora@lists.uib.no>
    Sent: 19 ноября 2002 г. 15:04
    Subject: Re: [Corpora-List] On tools for indexing and searching large
    corpora

    > mdavies@ilstu.edu wrote:
    > >
    > > This is a question that I've asked myself many times. I would love to
    see a
    > > book that discussed the approach taken by the BNC, the BoE, CREA,
    corpora based
    > > on the IMS Corpus Workbench (such as O Público), etc to "look under the
    hood"
    > > and see how each of these corpora and indexing schemes is organized. As
    you
    > > mentioned, as more and more people start creating 100+ million word
    corpora, it
    > > would be a shame if they all ended up having to re-invent the wheel.
    >
    >
    > I don't know of such a book, but for the IMS Corpus Workbench I believe
    > that some of the ideas concerning data storage and indexing schemes were
    > taken from this book:
    >
    > Ian H. Witten, Alistair Moffat, and Timothy C. Bell
    > Managing Gigabytes
    > Compressing and Indexing Documents and Images
    > May 1999
    >
    > (here's a link to the second edition of the book:
    > http://www.cs.mu.oz.au/mg/).
    >
    > Regards,
    > Arne Fitschen
    >



    This archive was generated by hypermail 2b29 : Wed Nov 20 2002 - 11:11:56 MET