Corpora: efficient string matching

Tom Vanallemeersch (Tom.Vanallemeersch@lant.be)
Wed, 08 Apr 1998 09:44:24 +0200

Dear subscribers,

I have some questions on string matching, arising both from a practical
and
theoretical interest. Do there exist programs or algorithms which allow
for
the retrieval of strings from files (corpora or word lists in my case)
and
which are not working sequentially ? That is, is there a way of encoding
files in such a way that retrieval speed will not drastically differ for
small and large texts (say, between a 100 kilobyte and a 10 megabyte
corpus) ? This difference exists, for instance, for the well known grep
family, which has a performance of order O(n).

Of course, there are a whole range of programs that store texts or other
kinds of sequences as a database, by indexing on one or other element
(word,
sentence, phrases, ...). This allows for very fast retrieval. What
interests
me, however, is to know whether there a programs which some way store
information on the substrings of a text (e.g. all substrings up to a
length
of 10), and possibly give the context of those strings.

This leads me to a further question. There is a wonderful program called
agrep which performs like grep, but also does approximate string
matching.
However, the number and type of errors allowed in the input string is
limited. For example, one cannot find "equipment" if given as input
"equipping", or "widely used", if giving the input "used widely".
Is there any program which can do this ? More specifically, is there
a way of doing this efficiently ? (I would be surprised, but I still
ask ...).

Thanks for help,

Tom.

-- 
LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven
mailto:Tom.Vanallemeersch@lant.be               Phone: ++32 16 405140
http://www.lant.be/                             Fax: ++32 16 404961