Re: Corpora: Suggestor algorithms ?

Doug Cooper (doug@nwg.nectec.or.th)
Mon, 6 Oct 1997 19:34:29 +0700 (ICT)

On Mon, 6 Oct 1997, John Aitchison wrote:
> The larger question is how to generate a reasonable number of reasonably
> acceptable/close alternatives in the first instance (ie prior to
> resorting to some evaluation mechanism).

Actually, agrep lets you do the equivalent of specifying the edit
distance in advance, so you can find all matches within n edits of the
search string. Force, yes; brutal, no.

Of course, in English (high info content per letter), this may generate
lots of unlikely hits. To get more subtle, try to:

a) generate likely transposition errors (ie "peice"), and check against
the full word list;
b) generate likely key neighborhood / hand-switch errors ("aldo" for
"also"), and check against the full list;
c) generate Soundex and check against the (pre-Soundex'd) full list.

Then, take any candidates that pop out of the list and rank-order them
by their edit distance.

An interesting variation on the problem is phonetic lookup in a second
language; eg. Westerners seeking Thai words usually make errors in vowel
length, tone, and certain vowel and consonant substitutions, while when
Thais sound out English words, Thai phonetic rules lead to another set
(eg. "n" for "l"). What's interesting is that (with a little bit of
contrastive analysis) these errors appear to be both predictable, and
rank-orderable.

Doug Cooper
_________________________________________________
Southeast Asian Software Research Center, Bangkok
246-9311 (-28), Ext. 1617
doug@nwg.nectec.or.th http://seasrc.th.net
http://seasrc.th.net/sealang --> SEALANG Web site