Re: Corpora: efficient string matching

Dekai Wu (dekai@cs.ust.hk)
Mon, 13 Apr 1998 20:16:31 +0800

>Another very simple method that I have found useful is the following.
>Think of the corpus as being one long vector, so that each character
position
>described by an integer. Run through your corpus, constructing another
>vector which records the starting positions of all of the substrings you
>want to index on (e.g., all of the starting positions of words). Regard
these
>as pointers into your original corpus. Sort this vector of indices by the
>suffix string beginning at the position in the corpus that each index
points
>to.
>
>This sorted index now gives an alphabetic listing of suffix strings in
>the corpus. It is a trivial matter to search for any substring of any
>length with this sorted index using a binary search.

Incidentally, this method's conventional name is "suffix array" following
to Manber and Myers (1990) or "PAT array" following Gonnet (1987) who
discovered it independently.

-Dekai