Conclusion: What makes a Markov model hidden?

Helmut Feldweg (feldweg@sfs.nphil.uni-tuebingen.de)
Thu, 27 Apr 95 15:50:53 MET DST

I would like to thank all those who have responded so far to my
question about what makes a Markov model hidden: Chris Manning, Carl
de Marcken, Henning Reetz, Oliver Jakobs, Helmut Schmid, Aleksander
Murazku, Adam Kilgariff, and Ted Dunning.

Here is my summary of the responses I have received. As allmost all
contributions were also posted on this list, it is merely a synopsis,
interpretation of and commentary to the material you might already
have read along this thread.

Some people pointed me to "A Tutorial on Hidden Markov Models and
Selected Application in Speech Recognition" by Lawrence R. Rabiner,
which originally appeared in the Proceedings of the IEEE, Vol. 77,
No. 2, pages 257-286. It is also available as a reprint on pages
267-296 in "Readings in Speech Recognition", ed. by Alex Waibel and
Kai-Fu Lee, San Mateo: Morgan Kaufman, 1990. This tutorial is in fact
an excellent introduction into HMMs and their applications for speech
recognition. It also clearly defines the differences between
observable Markov models and hidden Markov models. My question,
however, focussed on the classification of some well-known existing
tagging applications towards hidden or non-hidden Markov models.

This question was motivated by the following sequence cited from Chris
Brew's <chrisbr@cogsci.ed.ac.uk> posting to the corpora list (dated
Mon, 17 Apr 95 16:02:37 +0100) on the subject "Re: Spanish":
>
> >> 1. Could someone point me toward some literature, documentation or code
> >> (preferably C) relating to part of speech taggers? Specifically, I am
> >> hoping to find info on Spanish taggers;
> >
> > The standard algorithm for part of speech tagging uses Hidden Markov
> > Models, and seems to be applicable to a wide variety of languages.
> >
> > actually *hidden* markov models are generally *not* used.
> I was relying on Cutting,Kupiec and Sibun's description in
> ANLP92, which starts. "We present an implementation of a
> part-of-speech tagger based on a hidden Markov model...".
> I probably should have made this clear, since the use
> of a hidden Model and untagged corpora makes the Xerox
> tagger different from earlier ones which need pre-tagged
> input. I was calling this "the standard algorithm", on
> the grounds that is implemented in the Xerox, Aquilex
> and Multext taggers.

One interpretation of this sequence is that training from untagged
text is a prerequisite for *hidden* Markov model tagging. I argue,
that the "hiddeness" of a Markov model has, in principle, nothing to
do with the way the model's parameters were obtained. Thus, the fact
that the Xerox tagger and its successors are able to obtain HMM
parameters from un-tagged texts rather than from pre-tagged texts
alone does not allow the judgement that they are "more hidden" than
earlier taggers as, e.g. Church's and DeRose's.

Chris Manning <manning@lcl.cmu.edu> has pointed out in his response
(dated Mon, 24 Apr 1995 20:44:00 -0400) to my query:
>
> Classyfing taggers as HMMs or just markov models is slightly
> complicated because taggers like Chruch's are mixed.
>
> For training on a tagged corpus, one can regard the outputs of such a
> tagger as a pair consisting of a markov model state (the tag) and a word.
> Thus it isn't an HMM because you can tell exactly which state the Markov
> model is in at what time. However, when you do tagging with the Viterbi
> algorithm, you are giving the tagger only the word and asking it to tell
> you what states the machine passed through, and so you are using the tagger
> as an HMM.

According to this, Church's tagger is a true HMM tagger as far as
tagging (as opposed to training) is concerned.

However, Chris states in the same posting:
> Taggers like the Xerox tagger are true HMM taggers in the sense that the
> training is also done via an HMM -- the tagger sees only the output words,
> and has to guess which part of speech sequence the HMM is moving through.
>
> To the extent that training is the most important part, I think the former
> class should be regarded as markov model taggers and the second as HMM
> taggers, but in reality the first kind is mixed.

I have a different view here. For a tagger, the most important part
seems to be tagging to me. It is a different, although very important
question, how the parameters for the tagger were obtained. In praxis,
none of the taggers classified by Chris as true HMM taggers rely solely
on Baum-Welch estimation of the HMM parameters when it comes
to high precison tagging. The Xerox tagger requires manual tuning by
initial biases, Elworthy's Acquilex tagger (as do Helmut Schmid's
extensions for the Xerox tagger, as does the Multext tagger?) allows
initialisation of parameters from pre-tagged texts before Baum-Welch
re-estimation is employed. Thus, if true HMM means Baum-Welch
estimation of parameters, none of these were real true HMM taggers.

In contrast to what Ted Dunning has been stipulated in his message
(dated Mon, 17 Apr 1995 07:35:50 -0600) on the subject "Spanish":
>
> actually *hidden* markov models are generally *not* used.
>
*hidden* markov models are generally used in part-of-speech tagging
and they were generally used even before the advent of the Xerox
tagger. The "H" in the HMM has nothing to do with the way its
parameters were obtained.

I feel that this view is supported by Ted Dunning's response (dated
Tue, 25 Apr 1995 13:02:46 -0600) to my query:
>
> the distinction between hidden and non-hidden markov models is (as an
> earlier poster mentioned) based on whether or not the output of the
> model is the actual sequence of states of the markov model. for a
> hidden markov model, the output is not the internal state sequence,
> but is a probabilistic function of this internal sequence. some
> people make the output symbols be a function of each internal state,
> some make the output be a function of the transitions. clearly,
> having the output be a deterministic function of the internal state
> sequence is a subset of the probabilistic case. also, hidden markov
> models are a superset of the direct models (since the output function
> can be the identity function).

Helmut Feldweg