WORD SENSE DISAMBIGUATION SURVEY: SUMMARY OF RESPONSES

Adam Kilgarriff (ak28@it-research-institute.brighton.ac.uk)
Thu, 22 Jun 95 11:46:52 BST

mailings I sent to particular people, and conversations on the theme,
I had 28 replies. Relevant applications fell into three types:

* Information retrieval (IR)
* Machine Translation (MT)
* Residual, 'core' NLP (including database front ends,
dialogue systems, Information Extraction such as MUC)
- which I'll call NLU.

First, the conclusions:

Does WS ambiguity cause problems for NLP applications?

Answers seem to be:

(1) IR: yes, to some moderate degree. Problems can substantially
be overcome by using longer queries. Within IR, WSD features as
something of an alternative to NLP.

(2) MT: yes. Huge problem. Addressed to date by lots and lots of
selection restrictions.

(3) NLU: not much. NLU applications are mostly domain specific, and
have some sort of domain model. It is generally necessary to have a
detailed knowledge of the word senses that are in the domain, so the
knowledge to disambiguate will often be available in the domain model
even where it has not explicitly been added for disambiguation
purposes.

Now, an annotated selection of survey responses

> Background [from original mailing]
> ==========
>
> While there is now a substantial literature on the problem of word
> sense disambiguation, this is almost always divorced from any
> application. The goal is to disambiguate between the senses given
> in a dictionary or thesaurus on the grounds that that, or something
> similar, is necessary for full understanding (and is an interesting
> problem in its own right). If this work is to feed in to NLP
> applications, we need to ask: where, and how, does word sense
> ambiguity cause problems for applications? We can then address
> questions of how disambiguation work can be made more practical and
> how it can be customised to particular applications. This is one goal
> of SEAL, our EPSRC-funded grant at the University of Brighton.
>
> As a first step, I'm gathering examples and anecdotes, as well as
> references, papers, or figures if you have any, of the sorts of
> problems that ambiguity has actually caused.
>
> Please note that I am not concerned with word class ambiguity (eg
> 'bank' as a verb or a noun) but only ambiguity within a word class
> (nominal 'bank' as money-bank or river-bank).

First, Judith Klavans challenged the premise:

> I disagree with the statement that most research on word
> sense disambiguation is divorced from applications.
> In general, applications have given rise to the headaches that
> drive the research. Any project on text understanding,
> mt, information retrieval, has this as a component. What
> is true is that once this problem is identified, it can then
> be viewed in isolation from the application.

The remainder of this report goes some way towards establishing
whether that's true.

Bob Amsler:

> The arbitrary distinction between word class and word sense seems
> to me to already separate this problem from "real world applications".
> The largest set of ambiguities I see are between proper nouns
> and common nouns, which I guess you could claim is a word class
> problem (e.g. between "Time" (magazine) and "time" (noun/verb)).
> Most databases in commercial use don't distinguish upper/lower
> case or more accurately, proper nouns from common nouns in full-text
> search.

The point about proper names was echoed by several others from work in
IR and Info Extraction (Takahiro Wakao, Lynne Cahill). Indonesian
placenames Lamp, Green, Data and Of caused much distress in some
quarters!

I had separated the within-POS problem from the general disambiguation
problem because, with the accuracy of state-of-the-art taggers (at
least for grammatical standard English, and where some errors can be
tolerated), POS-disambiguation is well on the way to being solved.
Amsler's point showed that current IR technology does not generally
use POS-tagging, and also that, within IR, WSD can be viewed as an
ALTERNATIVE to NLP, rather than a technique within it. One way to get
closer to the meaning of a text than a basic statistical model is WSD.
Another is to use NLP techniques which, for example, extract
head-modifier pairs (see, eg, Strzalkowski, Robust Text Processing in
Automated IR, ANLP 1994).

The use of stemmers also points to how current IR views WSD in
contrast to NLP. The stemmer throws away all linguistic information -
even major word category, and in spite of the fact that this will
often introduce ambiguity where none existed before. Eg "publishing"
as in "publishing industry" gets stemmed to "publish" so is now
ambiguous with "the letters published today". The text is then viewed
as a bag of stems - which is fine, as input to an IR WSD program.

Rebecca Wheeler responded with

> It seems to me, that before proceeding in a discussion of 'word sense',
> one ought define what one means by 'sense'

and an account of how she would go about this, looking in particular
at syntactic distinctions in, eg, subcategorisation, behaviour under
negation. She also argues for different levels of nesting of senses
so that one cannot simply say "same sense or not", since two readings
may be the same sense at one level of generality, but distinguished as
different subsenses.

By way of response, Amsler "started a riot" (his term) with the
following:

> The distinction I usually make is that I expect true
> ambiguity resolution to find the distinctions between senses
> as detailed in a specific published dictionary. However, in
> applications that matter (i.e. commercial information storage and
> retrieval systems fielding keyword queries) that degree of fineness
> is too much to ask for--
>
> I would settle for differences in which the subject domain
> of the sense differs from one sense to another. ... "understand"
> would, off the top of my head, (i.e. not consulting a collegiate
> or unabridged dictionary) have a generic sense and probably one
> in cognitive psychology or artificial intelligence ...
>
> I don't intend this to inflame the linguistic lexical semanticists;
> but I believe there are degrees of lexical meaning that we can
> say have practical distinction for information storage and retrieval
> of text and others than have little or none. The problem is that
> too little attention has been paid by linguists to that type of
> distinction and it has given linguistics a bad reputation in the
> practical application world.

He succeeded in generating a vigorous debate. Ted Dunning:

> there is a problem with this definition, because people
> can't perform this disambiguation task with any reliability or even
> repeatability.
>
> to anyone from a science in which empirical evidence is considered to
> be the the basis for theory confirmation, it is hard to take this sort
> of definition of sense distinction very seriously.

Michael Sperberg-McQueen:

> The only problem I have with this is the implicit assumption that
> the senses given in published dictionaries are disjoint. Since
> the senses are often not disjoint, any ambiguity resolution which
> always chooses exactly one active sense is inherently wrong in any
> case where more than one sense applies.

Amsler's reply was that dictionaries are all we have so we need to use
them, despite their flaws, and that lexicographers try to make senses
distinct: the reason they fail is often that they lack the space to
spell out distinctions in full. I replied that 'distinct senses' is
just one of many goals in lexicography.

Dunning was challenged on his claim that human subjects often don't
get the same answers as each other if asked "which of the following
list of dictionary-senses applies to THIS instance of the word?" Some
anecdotal evidence from Matthew Haines (involving Japanese and
English) supported Dunning's claim. Dunning also had two attempts at
doing some CORPORA-list-based research, where he provided list readers
with
(1) the full dictionary definition of (first) "stock" and
(then) "time", and
(2) a set of corpus instances for each word

and asked readers to say which sense was being used in each corpus
instance, and mail the answers back to him. I don't know whether he
got any responses. [For a similar exercise, see my paper in Computers
and the Humanities (26), 1993.]

Steve Finch was sceptical of the idea that humans did anything akin to
WSD, concluding:

> Whether it is a well
> defined problem in its own right is an open question; I think that it
> stems from reading too many dictionaries :-)

IR literature
=============

Many of the responses came from the IR community, and it was clear
that word sense disambiguation (WSD) is a big topic here, and that
most of the WSD sub-industry which has developed over the last ten
years has been driven by hopes of improving IR performance.

I was pointed to various IR papers, notably Krovetz and Croft,
"Lexical Ambiguity and IR" (ACM-Info Systems, 10(2), 1992), Sanderson
(WSD and IR, SIGIR '94) and Schutze and Pedersen (IR Based on Word
Senses; forthcoming). Krovetz and Croft conducted some experiments
which pointed to the conclusion that WS-ambiguity causes remarkably
little degradation of IR performance (eg, for their corpus, it
appeared that a perfect WSD program would only improve performance by
2%). Sanderson's conclusions were similar: WSD is probably only
relevant where the query is very short, and WSD errors may actually
degrade performance. Schutze and Pedersen, on the other hand, found
that the performance of their system was improved by 7-14% over a
'baseline' system by the addition of a disambiguation module.

In view of Krovetz and Croft's well-known findings, one might ask, why
is there so much IR work in WSD? One possible reason is that the
picture wasn't accurate (cf Schutze and Pedersen). Others might
include
(1) that it's an interesting challenge in
its own right
(2) humans do it (or, at least, humans don't get confused by
ambiguity ... this might or might not be the same thing!)
(3) the machine readable dictionary was a new and interesting
object looking for a use. It dealt with general English, as did IR,
and, to a first approximation, its conception of a word sense matched
IR's. So disambiguation was an arena to play with a new toy in.
(4) a similar point relates to corpora. The IR community has
them, in huge quantities - what intersting things can we do with them?
Maybe disambiguation.

Machine Translation
===================

The responses here were the comment from two sources, "yes, of course
WS ambiguity causes lots of errors" (eg French "repasser" translated into
English by Systran as "pass by again" rather than "iron (clothes)") - and
directions into the literature.

While the literature mournfully agreed that it was a huge problem, it
had little to say about it. Hutchins and Somers (Intro to MT,
Academic Press, 1992) point out the two variants of the problem:
monolingual ambiguity (where, in the source language, the word is
ambiguous) and translational ambiguity (where speakers of the source
language do not consider the word ambiguous but it has two possible
translations - thus English "blue" gets a different Russian
translation if it is light blue or dark blue.)

MT is a technology rather than a science. MT systems take a long time
to build. So the theory available at their inception is destined to
be out of date by the time they can perform. Thus none of the recent
era of WSD work is employed in existing MT systems, all of which use
extensive (often very extensive) sets of selection restictions paired
with semantic features to make it possible for the system to make the
correct lexical choice. MT systems usually use a number of
very large lexicons where selection restriction information, designed
to resolve ambiguity problems, accounts for a large proportion of the
bulk. The SYSTRAN English-French lexicon responsible for word choice
contains 400 rules governing the one English word, "oil", and when it
should be translated as "huile", when "petrole" (from Hutchins and
Somers).

The MT literature is also rather out of date in how much attention it
has accorded the lexicon. In mainstream CL, the lexicon has moved
into the limelight in recent years. Hutchins and Somers devote just 23
pages to it, dotted about a 350-page book. This is all the more
surprising since the bulk of MT system-development person-hours go
into lexicography, and the lexicons are the MT companies' greatest
assets (Christian Boitet, lecture).

One paper which does bring state-of-the-art WSD to bear on Machine
Translation, albeit in experimental mode, is Dagan and Itai (CL,
20(4), 1994).

NLU
===

Anne de Roeck, working on db front ends:

> The underlying assumptions are that (i) for DB interrogation
> you need accurate "deep" understanding (or users get the wrong
> information) and (ii) for reasons of portability, you definitely do
> not ever, ever, want to port a domain model with associated inference
> engine. ... English word senses within a single category
> need to be customised for every new domain. We make no attempt at
> capturing "general" lexical meaning because we believe this to be in
> conflict with the criteria for a reliable but commercially realistic
> application of this kind. Every word sense is mapped onto database
> constructs by means of a table only (call this table the EDM for
> extended data Model). This EDM only comes into play when the FOL
> expression is mapped onto SQL.

Lynne Cahill, working on info extraction:

> A lot of the problems we encountered with POETIC were related to the
> rather strange sublanguage definitions of certain words and
> abbreviations. ... The "normal" words which caused problems were
> things like "out" and "down" which had semantics along the lines of
> "not functioning" and "blocking" respectively (as applied to traffic
> lights and cables respectively). ... The problems you are interested
> in were, I think, mostly avoided by the severe selection restrictions
> the grammar placed on them, so generally the word "side" would only be
> interpreted as something interesting if it was associated with either
> a vehicle (e.g. "car on side") or a road (e.g. "on side of road").

Other comments:
"at" is a problem for a telephone query system about
flight bookings: is it "at time" or "at place".
The specific sublanguage you are working in generally means,
if you find a word with a sense you are interested in, the word will
generally be being used in that sense and not some other.
"We don't have any semantics in our lexicon, we just have
hooks into the knowledge representation"

BUT Roberto Garigliano/Sengan Short, working on the general purpose
LOLITA system:

> Currently, our system disambiguates each word as soon as
> possible. This policy reduces the possible combinatorial explosion of
> interpretations of ambiguous sentences. This is particularly critical
> as apparently simple words such as ``point'' have over 20
> meanings. High degrees of ambiguity are more the rule than the
> exception. This strategy is implemented in a variety of ways ...

For most NLU applications, the domain is a narrow one so there just is
not very much WS ambiguity in it. Also, to *understand* sufficiently
well to do some task, you will need a domain model (or database tables
to connect to, or - as in MUC - templates to fill in). For the most
part the lexicon does not need to contain any lexical semantics. It
just needs pointers to the relevant construct in the domain model.
The domain-specific lexicon needs constructing from the domain model,
rather than, for instance, from a machine-readable dictionary: if you
did the latter, you would be introducing lots of ambiguity that the
application does not need to know about.

Where a word has one sense in the domain model, and one or more
outside it, the system can generally determine whether the word is
being used in the domain sense by identifying whether the entire
sentence/query/input is coherent in terms of the domain model. If it
is, the word is almost certainly being used in the domain sense. Where
a word has more than one domain sense, it is unlikely that both will
produce coherent analyses. The domain model will generally provide
disambiguating material, not because it has been explicitly added, but
because type-checking, coherence-checking etc. which is necessary in
any case will reject invalid senses. (Hobbs makes this point in
his discussion of 'Vanilla Information Extraction', MUC5 proceedings)

Where NLU systems aim at general coverage (eg LOLITA) they do not have
a 'deep' domain model, nor can they ignore most dictionary senses, so
WS ambiguity is a big problem.

Adam Kilgarriff
23 June 1995