Re: Corpora: Using a relational database to store conc pointers

From: Tom Vanallemeersch (Tom.Vanallemeersch@lant.be)
Date: Thu Mar 30 2000 - 11:08:55 MET DST

  • Next message: Christina Rosén: "Corpora: German corp. Thanks"

    Mickel Grönroos wrote:
    >
    > Dear colleagues,
    >
    > Does anybody have any experience of using a relational database to store
    > index information for a concordance service?
    >
    > I'm building a test interface for the Bank of Finnish and plan to store
    > pointers to specific locations in the corpus in a database column, e.g.
    > something like 344:2555 would point to corpus file number 344, byte
    > position 2555.
    >
    > The obvious problem is how one should handle common words, as every
    > occurence of a specific type needs a pointer of its own. So, if the
    > frequency of some common word is, say 50,000 this would generate 50,000
    > pointers as well. Putting these in one field in a column seems to be
    > rather foolish. Does anybody know how to avoid this?
    >
    A possible approach may be to create a list (array) of pointers,
    starting with all pointers for word1, then those for word2, etc.
    Then create two fields for each word, i.e. the position of the
    first pointer for the word in the array, and the position for the
    last one.

    It is also possible to sensibly reduce the above mentioned array
    by compressing ordered lists of occurrence positions. I found a
    paper by Alistair Moffat at the Dept. of Computer Science of Univ.
    of Melbourne describing a method for compressing ordered list of
    numbers. I implemented that and generally speaking the information
    needed for each occurrence gets smaller the more occurrences there are
    (given the same text length). So when using these compressed lists,
    concatenated as a sequence, one could create two fields for each
    word, one specifying the start in the sequence and one for the end.
    As an example of compression performance, I generated 10,000 random
    numbers
    between 0 and 100,000,000 and after compression each number needed
    around
    18 bits on average, which is almost half of the 32 bits you would need
    when storing such a list of numbers in the obvious way.
    If you want, I can send you the program to have a look at it.

    Cheers,

    Tom

    -- 
    LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven
    mailto:Tom.Vanallemeersch@lant.be               Phone: ++32 16 405140
    http://www.lant.be/                             Fax: ++32 16 404961
    



    This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 09:03:24 MET DST