Re: Corpora: Lexical confusions (was Suggestor algorithms)

John Aitchison (jaitchison@acm.org)
Wed, 8 Oct 1997 11:13:33 +0000

snip, snip .. a lot of interesting ideas..

> A few more comments on the same problem: In this thread, we've been talking
> about spelling errors as they occur primarily in typewritten text. To suggest
> alternatives, one searches a database of correctly spelled words

well, in fact this part of the problem is not completely trivial without some
limitation on the task .. for the sake of argument, searching a
hashed dictionary on a PC of 100,000 words at say 20,000 words per
second .. too slow. The 'obvious' heuristic (of assuming the first
character is correct) is perhaps adequate, perhaps not (eg in an OCR
application) . Adequate at least to limiting the search.

, returning to
> the user a ranked set of orthographically or phonologically similar
> 'neighbors'.

I would be grateful for any pointers to the computation of
'phonological similarity' .. given two phoneme strings, is there an
established method of computing some reasonable 'distance'

In my work on look-alike and sound-alike medication errors, the
> problem we've been discussing arises in a very general form. In the domain of
> medication errors, the task is to anticipate (and prevent, if possible),
> lexical confusions involving multiple perceptual modalities and multiple
> communication media.
>
>
> For example, sometimes medication errors occur because a handwritten
> prescription is faxed to a pharmacy, and the blurred fax is misread. This is a
> (visual) perceptual recognition error. To anticipate it, measures of
> orthographic similarity would have to take into account OCR-type error patterns
> for both handwriting and type (i.e., similarity between p and r, m and n,
> etc.). On the other hand, sometimes errors occur when a pharmacist misremembers
> (or mishears) one word (e.g., Zantac) that sounds like another word (e.g.,
> Xanax). This is either a short-term memory error or an audotory perceptual
> error involving phonological similarity. To anticipate it, one would need
> accurate phonological representations of words, and a good way of computing
> similarity between these representations.

Yes please.

Still other errors can occur when a
> drug name is mistyped into a computer. To anticipate these errors, one would
> need to take into account typical insertions, deletions, transpositions, and
> substitutions as they occur on a standard 'qwerty' keyboard. Finally, errors
> occasionally occur because one drug is confused with another drug that shares
> the same indication, dosage form, manufacturer, mechanism of action, color,
> shape, etc. Here one needs a 'semantic' representation of each drug, not just a
> phonological or orthographic one. On top of all of this, one needs frequency
> data on all possible drug names, because the well-known word frequency effect
> strongly influences the type and direction of confusions that may occur (i.e.,
> rare words are more likely to be misperceived and misremembered if they have
> high frequency neighbors).

Neighbours as in neighbours in a lexicon ? Is this a result from
psychophysics or ?


> Thus, what's needed are measures for predicting the likelihood of lexical
> confusions across media and perceptual modalities. Also needed are strategies
> for preventing these confusions when we predict they are likely to occur. I
> offer these examples beacuse I think they may shed some light on the general
> problem of lexical confusion, and because I could use some feedback on how to
> proceed.

Well, I started out with a nice simple problem <g> (of finding an
efficient algorithm for suggesting reasonable alternatives) and ended
up with a Research Field <g> . Ain't it always the way.

To recap briefly ... the assertion is that spelling checkers do a
poor job of suggesting alternatives to mispelt words and that brute
force searches and strategies based on strong assumptions and simple
edit distance/approximate string matching are potentially the culprits.
The other problem is likely to be lack of consideration of context.
No simple approaches suggest themselves at present.

Anyone got a database of documents containing spelling errors ?

It occurs to me that the Corpora list may have had enough of this
topic by now, so I will refrain from further postings on it.

Thanks for all the help.

John Aitchison <jaitchison@acm.org>
Data Sciences Pty Ltd
Sydney, AUSTRALIA.