Re: Corpora: Using a relational database to store conc pointers

From: Tom Vanallemeersch (Tom.Vanallemeersch@lant.be)
Date: Thu Mar 30 2000 - 11:08:55 MET DST

Next message: Christina Rosén: "Corpora: German corp. Thanks"

Previous message: Gordon and Pam Cain: "Corpora: that virus!"
In reply to: Mickel Grönroos: "Corpora: Using a relational database to store conc pointers"
Next in thread: Oliver Mason: "Re: Corpora: Using a relational database to store conc pointers"
Reply: Oliver Mason: "Re: Corpora: Using a relational database to store conc pointers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mickel Grönroos wrote:
>
> Dear colleagues,
>
> Does anybody have any experience of using a relational database to store
> index information for a concordance service?
>
> I'm building a test interface for the Bank of Finnish and plan to store
> pointers to specific locations in the corpus in a database column, e.g.
> something like 344:2555 would point to corpus file number 344, byte
> position 2555.
>
> The obvious problem is how one should handle common words, as every
> occurence of a specific type needs a pointer of its own. So, if the
> frequency of some common word is, say 50,000 this would generate 50,000
> pointers as well. Putting these in one field in a column seems to be
> rather foolish. Does anybody know how to avoid this?
>
A possible approach may be to create a list (array) of pointers,
starting with all pointers for word1, then those for word2, etc.
Then create two fields for each word, i.e. the position of the
first pointer for the word in the array, and the position for the
last one.

It is also possible to sensibly reduce the above mentioned array
by compressing ordered lists of occurrence positions. I found a
paper by Alistair Moffat at the Dept. of Computer Science of Univ.
of Melbourne describing a method for compressing ordered list of
numbers. I implemented that and generally speaking the information
needed for each occurrence gets smaller the more occurrences there are
(given the same text length). So when using these compressed lists,
concatenated as a sequence, one could create two fields for each
word, one specifying the start in the sequence and one for the end.
As an example of compression performance, I generated 10,000 random
numbers
between 0 and 100,000,000 and after compression each number needed
around
18 bits on average, which is almost half of the 32 bits you would need
when storing such a list of numbers in the obvious way.
If you want, I can send you the program to have a look at it.

Cheers,

Tom

-- 
LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven
mailto:Tom.Vanallemeersch@lant.be               Phone: ++32 16 405140
http://www.lant.be/                             Fax: ++32 16 404961

Next message: Christina Rosén: "Corpora: German corp. Thanks"
Previous message: Gordon and Pam Cain: "Corpora: that virus!"
In reply to: Mickel Grönroos: "Corpora: Using a relational database to store conc pointers"
Next in thread: Oliver Mason: "Re: Corpora: Using a relational database to store conc pointers"
Reply: Oliver Mason: "Re: Corpora: Using a relational database to store conc pointers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 09:03:24 MET DST