Corpora: Summary: Relational databases and conc pointers

From: Mickel Grönroos (mcgronro@ling.helsinki.fi)
Date: Tue Apr 04 2000 - 09:14:16 MET DST

  • Next message: Yorick Wilks: "Corpora: THIRD BELLAGIO WORKSHOP ON HUMAN COMPUTER CONVERSATION: LAST CALL"

    Dear colleagues,

    Thanks to Chris Brew, Alexander Clark, David Graff, Jochen Leidner, Oliver
    Mason, Manne Miettinen, Mika Rissanen, Tylman Ule and Tom Vanallemeersch
    (alphabetically listed) for useful tips and a fruitful discussion on how to
    combine a relational database with a concordancer/collocator.

    The suggestions varied somewhat, but mainly they can be divided into the
    following two groups:

    1 All information is stored in the database

    2 Type specific information is stored in the database, with
            pointers to a pointer list containing the information needed
            for token lookup

    I'll try to explain the two approaches lightly:

    1 The first approach makes extensive use of the database architecture. For
    each token in the corpus you generate a row in a database table, e.g.
    something like this:

            | tokenId | typeId | file | byteOffset |
            |---------+--------+------+-------------|
            | 120 | 1 | 12 | 1443 |
            
    This says that the 120th occurence of the word type numbered 1 is found in
    file number 12, starting at byte position 1443.

    Seems rather straightforward, doesn't it? Well, it still raises the question
    if it is sensible to store say 50,000 rows in a db table for just one high
    frequency wordform (since each token in the corpus generates a row of its
    own in the pointer table).

    Databases are of course intended to handle tables with several million rows,
    so technically this should be possible to implement, as long as the corpus
    being indexed does not contain half a billion words or so ... But still, is
    it sensible?

    2 The second approach takes into consideration that it is a waste of db
    space to store separate records for each and every pointer to the corpus
    files. Instead the pointers are stored in an file outside the database. The
    database will then contain a table with pointers to this external index
    instead, like this:

            | idx | byteStart | byteOffset |
            +-----+-----------+------------+
            | 1 | 1170 | 251 |

    This says that the pointers needed for word type number 1 is found in the
    index file from byte position 1170 and 251 bytes onward. This information is
    then used by the software to fetch the appropriate information from the
    index and then from the corpus files.

    The elegance in this approach is that more or less identical information do
    not have to be stored in a database and above all that the index file can be
    compressed. Without compression it is likely that the index file will be of
    almost the same size as the corpus itself (as every token will generate a
    pointer of its own). With compression the index file will shrink
    considerably.

    I don't have any experience on using index compression myself, but this
    possibility was raised by Chris Brew, Oliver Mason and Tom Vanallemeersch
    and it seems rational. See Witten, Moffat & Bell (1994) "Managing Gigabytes:
    Compressing and Indexing Documents and Images" or Baeza-Yates & Ribeiro-Neto
    "Modern Information Retrieval" (pp 184 ff) for more information.

    Thank you for reading.

    Cheers,

    Mickel Grönroos
    University of Helsinki

    www.ling.helsinki.fi/~mcgronro/ | Mickel.Gronroos@helsinki.fi
    ---------------------------------|----------------------------
    Inst. för allmän språkvetenskap | Dep. of General Linguistics
    PB 4 (Fabiansgatan 28) | tfn/phone +358-9-191 22707
    FI-00014 Helsingfors universitet | fax +358-9-191 23598



    This archive was generated by hypermail 2b29 : Tue Apr 04 2000 - 09:13:25 MET DST