p(w/t) vs p(t/w) in tagging

Bernard Merialdo (merialdo@eurecom.fr)
Fri, 19 Apr 96 12:33:22 +0200

Pete Whitelock writes:
> Can anyone explain to me why in the standard tagger model, the lexical
> probability is defined as the probability of the word given the tag rather
> than the tag given the word. The latter would seem much more intuitive
> (as well as easier to estimate), but is reported to give worse results
> (e.g. the discussion in Charniak's book p.50). Is there a good reason for this?

there is a good reason. the tagger model uses a Hidden Markov Model
where the observed output symbols are the words, and the hidden
structure is based on the tags.
the HMM gives an estimate of the probability p(W) that a sequence of
words is produced. by decomposing the probability, the tagger model
is (in the case of a tri-class model):
p(W) = p(w1w2...wn)
= product p(wi/wi-2,wi-1) // depends on the last 2 words
= product p(wi/ti).p(ti/ti-2,ti-1) // depends on tag only

thus the term that is introduced naturally is p(wi/ti) and not
p(ti/wi).

however, there is a simple relation:
p(wi/ti) = p(ti/wi) * p(wi) / p(ti)

so that when the model is built, it is straightforward to produce all
probabilities.

-- 
============================================================================
   Bernard Merialdo			!	e-mail : merialdo@eurecom.fr
   Professor				!
   Multimedia Communications Dept	!
   Institut EURECOM			!	tel : +33   93 00 26 29
   2229 Route des Cretes		!	sec : +33   93 00 26 26
   B.P. 193        			!	fax : +33   93 00 26 27
   06904 Valbonne Cedex - FRANCE	!
   http://www.eurecom.fr/Htdocs_media/People/merialdo.html
============================================================================