Re: Corpora: Lexicon development for MT

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Fri, 11 Sep 1998 19:59:32 +0100

Ted Dunning's comments demand a rejoinder.

I asserted that ... templates, inheritance, and sophisticated
techniques are only accidentally likely to help with disambiguating
source language words for MT: Ted's response --

td> if that is what you meant, then your opinion of the field is odd and
td> idiosyncratic.
td>
td> for instance, the use of inheritance and other similar techniques at
td> New Mexico State is intended precisely to minimize the number of rules
td> which must be crafted to build real MT systems.
td>

inheritance serves many purposes, and that may well include
disambiguation. However, homonyms do not tend to fall in classes, so
specific rules for disambiguating one homonym will not generally apply
to a class of items. Example: adjectival "mean" can be STINGY or
AVERAGE. It doesn't fall in a natural class of things meaning either
STINGY or AVERAGE. Inherited information might tell you enough to
disambiguate, but it very often won't. Then you'll need word-specific
disambigaution rules and they will only be relevant for disambiguating
"mean".

td> similarly, the use of parallel corpora at NMSU is intended primarily
td> to assist in semi-automated translation, and for cross-lingual text
td> retrieval.

Yes, of course these things assist in tasks which are intimately
linked to MT but ... In semi-automatic MT I'm not sure which part is
the 'semi' but I suspect the business of choosing the correct
translation for ambiguous words is NOT one that the computer does very
well at. Cross-lingual text retrieval is not MT - it is a statistical
process which can tolerate a degree of mismatch and hedging (eg by
giving probabilities) in a way that MT cannot. Full MT (eg what you
get when you press the 'translate' button in Altavista) has to
disambiguate, commit itself to its first choice (no probabilities),
and looks foolish if it gets it wrong. SYSTRAN has 400 rules for
oil/huile vs oil/petrole because it needs to get it right, and to do
that with a confidence ranking far higher than current stats-based MT
can manage.

td>
td> ak> Much Word Sense Disambiguation work is in principle relevant,
td> ak> but, with the honourable exception of Dagan and Itai (CL 20
td> ak> (4), 1994) it is not clear whether any of it can be tailored
td> ak> to the specific needs of an MT system (and I do not believe
td> ak> any of it has been).
td>
td> excuse me? i must not understand what you mean by word sense
td> disambiguation or perhaps what you mean by MT system.
td>
td> what about the work of Mercer, Brown and the others at IBM who crafted
td> an entire MT system around the concept that parallel corpora could
td> provide both lexicon and disambiguation? if you look at their work,
td> their methods are fundamentally designed to resolve ambiguity in
td> translation via the use of parallel corpora.

Yes but a thousand questions remain about how useful a model it is
(like, what do you do for parallel resources if your task is something
other than translating Canadian Hansard from En to Fr or vice versa),
or how well the model performs on the particularly hard subtask of
selecting the correct translation of an ambiguous word. Also it is
not a model which has been adopted for commercial MT or any systems
anywhere near the market. My query was concerned with medium-term,
plausible, practical strategies rather than high theory - I guess I
could have made this clearer.

td>
td> the methods pioneered by the IBM group have been extended greatly by
td> many others. their basic methods were used a number of other
td> researchers including Gale, Church, Yarovsky, Bruce, Stevenson, Wilks
td> and others. not all of these researchers used the same definition of
td> word sense (some used dictionary senses rather than alternative
td> translations), but essentially all of them used context of usage to
td> statistically resolve ambiguity. all of these systems could easily
td> have been integrated back into the IBM Candide system if desired.

That last sentence is a bit like saying, "we know all about ocean
currents, how air masses move, etc etc so of course we know whether it
will rain tomorrow". Sure they could have been integrated. They
probably would have got the right answer sometimes, too.

So... STILL seeking info on clever approaches for providing
disambiguation information for MT lexicons,

adam

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%