Re: Corpora: efficient string matching

Rob Freeman (rjfreeman@usa.net)
Thu, 09 Apr 1998 09:59:29 +0900

> What interests
> me ... is to know whether there a programs which some way store
> information on the substrings of a text (e.g. all substrings up to a
> length of 10), and possibly give the context of those strings.

I don't know of a program like this, but I would be interested to hear if
you find one (or if anyone knows of one). I am trying to do things like this
in my work too.

> This leads me to a further question. There is a wonderful program called
> agrep which performs like grep, but also does approximate string
> matching.
> However, the number and type of errors allowed in the input string is
> limited. For example, one cannot find "equipment" if given as input
> "equipping", or "widely used", if giving the input "used widely".
> Is there any program which can do this ? More specifically, is there
> a way of doing this efficiently ? (I would be surprised, but I still
> ask ...).

I looked around fairly widely for string matching algorithms last year with
a mind to applying them to language. I also looked at agrep. One thing I
found which you might not have come across was the sizable history of
_biological_ sequence matching research. There is a big biological project
to map the human genome: DNA sequences. Many of the problems they face are
exactly the kind of thing we face searching for language sequences in
corpora. I found some great stuff to search for matching sub-sequences of
unknown length in two strings. You can feed in a search string and get back
any sub-strings which match at any point in a data-base, without specifying
which sub-string to search on, or the number of errors.

There was an implementation available free for research, called seqaln, I
think (sequence align?). Do an Internet search on "seqaln" and you should
find it.

If you are interested write to me and I could try to dig out the other
references I had last year.

Best,

Rob Freeman
rjfreeman@usa.net