Re: Corpora: efficient string matching

Mark Johnson (mj1@lx.cog.brown.edu)
Sat, 11 Apr 1998 12:04:15 +0000 (EDT)

Tom Vanallemeersch writes:
> What interests
> me ... is to know whether there a programs which some way store
> information on the substrings of a text (e.g. all substrings up to a
> length of 10), and possibly give the context of those strings.

This sounds like an application for a Patricia tree (I think this is the
name). A good book on algorithms should describe this data structure.

Another very simple method that I have found useful is the following.
Think of the corpus as being one long vector, so that each character position
described by an integer. Run through your corpus, constructing another
vector which records the starting positions of all of the substrings you
want to index on (e.g., all of the starting positions of words). Regard these
as pointers into your original corpus. Sort this vector of indices by the
suffix string beginning at the position in the corpus that each index points
to.

This sorted index now gives an alphabetic listing of suffix strings in
the corpus. It is a trivial matter to search for any substring of any
length with this sorted index using a binary search.

Hope this helps,

Mark