Re: Corpora: efficient string matching

Miles Osborne (Miles.Osborne@cl.cam.ac.uk)
Mon, 13 Apr 1998 12:38:23 +0100

>
Another very simple method that I have found useful is the following.
Think of the corpus as being one long vector, so that each character position
described by an integer. Run through your corpus, constructing another
vector which records the starting positions of all of the substrings you
want to index on (e.g., all of the starting positions of words). Regard these
as pointers into your original corpus. Sort this vector of indices by the
suffix string beginning at the position in the corpus that each index points
to.

This sorted index now gives an alphabetic listing of suffix strings in
the corpus. It is a trivial matter to search for any substring of any
length with this sorted index using a binary search.
>

This is similar to the LZ77 family of dictionary-based compression
algorithms (as found in eg gzip). They replace phrases in the input
text with pointers to phrases previously seen in the text. So, if you
want information on substrings in a text, simply work-out how these
substrings are stored ...

Miles Osborne