Re: What makes a Markov model hidden?

Ted Dunning (T.Dunning@dcs.shef.ac.uk)
Thu, 29 Jun 95 18:21:04 BST

(1) He writes:

> taggers like Church's are mixed.
>
> For training on a tagged corpus, one can regard the outputs of such a
> tagger as a pair consisting of a markov model state (the tag) and a word.
> Thus it isn't an HMM because you can tell exactly which state the Markov
> model is in at what time.

Even if this is not an HMM since the states can be seen it also does not
fit the definition of a Markov Model in the sense of [1].

Here a Markov Model only contains:
- states that can be seen
- arcs labeled with state transition probabilities (A)
- initial state probabilities (Pi)

this definition is incomplete. in particular, the state transition
probabilities are allowed to depend only on the current state and the
input. this may be implicit in using the graphical definition, but it
isn't clear from the text given here alone.

to show that church's tagger is markov (or more correctly, is
isomorphic to a markov model) we have to show how states can be
labeled so that we get the familiar definition. (i think that this
form of tagger is actually derived from earlier work by mercer and
brown, so saying church's tagger is a bit of a misnomer).

in any case, let's take the slightly more general case of a tagger in
which the current tag depends only on the k previous tags and the m
previous words plus the current word. this means that the conditional
distribution for the tag on word n is expressed as:

p(tag[n] | tag[n-k] ... tag[n-1], word[n-m] ... word[n])

now, let us label the states of the markov model using the previous
tags and words. if we define a shifting operator Z which works only on
state labels (this is just so that we have a shorter definition
later), then we get

Z(<tag[n-k]...tag[n-1], word[n-m]...word[n-1]>, tag[n], word[n]) =
<tag[n-k+1]...tag[n], word[n-m+1]...word[n]>

p(Z(S, tag[n], word[n]) | S, word[n]) =
p(tag[n] | tag[n-k] ... tag[n-1], word[n-m] ... word[n])

and all other p(S1 | S2, word) are assumed to be zero.

this is just a rewriting; we haven't done any real math yet at all.
but we now have a set of transition probabilities in which each state
is labeled with k tags and m words. furthermore, these transitions
satisfy the markov condition since the distribution of the next state
depends only on the current state and the input.

this shift style labelling is a pain in the ass most of the time.
this is because so much of the label on the previous state is repeated
in the label of the current state. it is much easier to just remember
that a discrete system with transition probabilities dependent only on
a finite number of previous states and inputs can be considered
Markov.

note that a part of speech tagger is entirely in the form of an
exposed or visible markov model if we choose to label the transitions
with part of speech tags.

the definition of a hidden markov model adds a probabilistic output
function and generally, we are only allowed to assume that we can
observe the result of the output function.

clearly, the class of hidden markov models subsumes the class of
visible markov models since the output function can be the identity
function.

the advantage of using a visible markov model with tagged data is that
the transition probabilities can be estimated directly using any of a
number of methods. if the model depends only on the current word and
the previous two tags and you have a lot of data, then maximum
likelihood estimators are probably practical (that is divide the count
by the total). if you use a more sophisticated model, then trickier
estimators are likely to be needed. deleted interpolation is likely
to be very helpful in such a case.

when the state sequence is not known (possibly because we are using
untagged data for training), then something like the Baum-Welch method
(also called the forward-backward algorithm and a special case of the
E-M algorithm) has to be used to estimate the transition probabilities
for the tagger. this doesn't say much of anything about the form of
the model used, however

in the context of a probabilistic parser of sufficiently general form
(see Magerman's recent work for an example), statistical part of
speech taggers may best be subsumed into the parser itself. in fact,
using human assigned parts of speech may not be as useful as merely
allowing the system to make up its own part of speech tags in any way
which optimizes the parsing process.

QUESTION 1:

I agree that taggers like Church's one are mixed, i.e. HMM + something
different, but when their training does not fit the definition of a Markov
Model [1], how can those taggers be regarded as markov model
taggers ?

this question is essentially vacuous. church's tagger used a visible
markov model for tagging and training was based on hand tagged data.
it also trivially and vacuously used a hidden markov model since any
visible MM is also a HMM. based on the availability of tagged data,
maximum likelihood estimators were used rather than using the forward
backward algorithm.

Can the process that is done before tagging with taggers like Church's one
be called 'training'?

of course it can.

This process does not start from a Model with default (a priori)
probabilities (e.g. derived from biases like in the XEROX tagger) which
then are automatically refined by training on a corpus.

so what?

Instead of this, we simply count frequencies from the 'training' corpus
and calculate from them probabilities.

to be pedantic we count occurences, derive frequencies and then
estimate transition probabilities. this case is so trivial that the
distinctions are difficult to see, but in other cases the distinctions
are important.

With those probabilities we construct the HMM which never is really
'trained'.

it isn't trained using any iterative techniques. instead, cleverer
methods are used which allow training to be done with only one pass
through the data. having only a single pass doesn't keep it from
being training.