Re: Corpora: Suggestor algorithms ?

jadams@lhs.com
Mon, 6 Oct 1997 10:17:08 -0400

> At the back of my mind I had the idea that maybe soundex could limit
> the search.

I'm not sure SOUNDEX is the most appropriate thing here. As I understand
it, SOUNDEX is great for discovering cognates, but I'm not sure why it
would be particularly appropriate for finding spelling errors. For
example, in SOUNDEX, letters are mapped to each other according to how
phonetically similar they are, and vowels are ignored altogether.

However, the *idea* of SOUNDEX might be applied to part of this
problem, by mapping letters which are frequently confused to some
common symbol. You could probably develop an "OCR-SOUNDEX", for example,
which maps P & R to the same symbol, and similarly with other common
confusions.

As far as human misspellings, however, I'm afraid that any substitition
scheme (a la SOUNDEX) would not be able to model simple things like
swapped or omited cahracters.

> tricky stuff

Indeed. But that's why it's so much fun, right?