Re: Corpora: Using a relational database to store conc pointers

From: Oliver Mason (oliver@clg.bham.ac.uk)
Date: Fri Mar 31 2000 - 10:47:20 MET DST

  • Next message: Wietze Helmantel: "Corpora: FW: Articles on the subject of word sense disambiguation."

    If one goes for implementing their own system instead of using a general-
    purpose database the definite guide is

    Witten, I., Moffat, A., Bell, T. (1994)
      Managing Gigabytes: Compressing and Indexing Documents and Images
      Van Nostrand Reinhold, New York.

    Despite its technical topic it is very readable, even for people without
    a mathematical background.

    <shameless plug>
    The CUE system (available from the Birmingham Corpus Research Website, and
    also through an application called QWICK on the BNC Sampler and the latest
    ICAME CD ROM) is a Java implementation of algorithms described there. Apart
    from just compressing the index, the text is also compressed, which means
    that the data size of the fully indexed corpus is below the size of the
    uncompressed plain text input file.
    </shameless plug>

    Oliver Christ pointed that book out to me about five years ago, and I believe
    the Stuttgart corpus access system is also based on it, as he was working on
    it at the time.

    Oliver

    -- 
    //\\ computer officer | corpus research | department of english | school of  -
    //\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
    \\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
    \\// mobile 07050 104504 | http://www.clg.bham.ac.uk | o.mason@bham.ac.uk\/  -
    



    This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 10:45:16 MET DST