Corpora: efficient string matching: summary of replies

Tom Vanallemeersch (Tom.Vanallemeersch@lant.be)
Tue, 14 Apr 1998 10:21:12 +0200

Dear corpora subscribers,

thank you very much for you replies to my question on efficient
retrieval
of strings in corpora, and on approximate string matching. The answers
mainly pointed to the use of PAT trees and PAT arrays, that allow for
the
alphabetical sorting of "semi-infinite strings" or sistrings, that is,
strings that start at some position and extend up to the end of the
document.
It is comparable to sorting variable length records, such as lines from
a text:
in the case of PAT trees or arrays, a record is a string starting at
some
position of the text. The positions of the text to be indexed can also
be
restricted: e.g. only the positions of the text were a word starts.

The two datastructures mentioned have both their pros and cons. PAT
trees
are binary trees, so they take more space than arrays. On the other
hand,
their construction is of order O(n), whereas arrays take order O(n
log(n)).
Of course, a binary tree should preferably be balanced, which also takes
some time, especially in case of already sorted input ! As to inserting
and deleting indexed positions, using PAT trees is much easier than
using
arrays. Retrieval time is of the same order for both structures,
provided
that the binary tree is balanced: O(log(n)).

Of course, both structures take quite a lot of memory in case of very
large corpora, but some work has been done by Baeza-Yates to distribute
the information over several resources (main memory, external devices).

These are some references for PAT trees and PAT arrays (another name for
PAT tree is suffix tree):

G.H. Gonnet, R. Baeza-Yates (1991): "Handbook of Algorithms and Data
Structures," Addison-Wesley.

W. B. Frakes, R. Baeza-Yates (1992): "Information Retrieval: Data
Structures and Algorithms", Prentice Hall.

a web site with Ricardo Baeza-Yates' publications:
www.dcc.uchile.cl/~rbaeza/cv/publ.html

E. McCreight, 1976: "A space-economical suffix tree construction
algorithm". Journal of the ACM, 23(2 :)262-272, April.

D.E. Knuth, 1973: "The Art of Computer Programming", Addison-Wesley.

Some replies to my mail involved the use of hashing tables, but I didn't
examine this in detail yet.

I would like to thank the following people for providing information:
Alex Collier, Arne Fitschen, Max Schulze, Martin Kay, Jason Eisner,
Ted E. Dunning, Mark Johnson, R. Chandrasekar, Miles Osborne, Dekai
Wu and Ken W. Church.

As to my question on approximate string matching, I was pointed to the
work
on DNA sequences (biological sequence matching research), which can be
found
e.g. by looking up the strings "seqaln", "BLAST" and "FASTA" on the
Internet.
I still have to look in more detail into the info I had from a number of
people
working on approximate string matching: Reinhard Rapp and Rob Freeman.
Janne
Lindberg gave me the following address of interest: brodda@ling.su.se
(Benny Brodda).

Tom Vanallemeersch.

-- 
LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven
mailto:Tom.Vanallemeersch@lant.be               Phone: ++32 16 405140
http://www.lant.be/                             Fax: ++32 16 404961