Spanish

Ted Dunning (ted@crl.nmsu.edu)
Mon, 17 Apr 1995 07:35:50 -0600

> 1. Could someone point me toward some literature, documentation or code
> (preferably C) relating to part of speech taggers? Specifically, I am
> hoping to find info on Spanish taggers;

The standard algorithm for part of speech tagging uses Hidden Markov
Models, and seems to be applicable to a wide variety of languages.

actually *hidden* markov models are generally *not* used.

It is quite likely that you will be able to use this algorithm
unchanged. What you may have to worry about is details of the size
and nature of the tag set, the amount of training material which you
have available and the extent to which you give the tagger hints to
bootstrap the training process.

while this is strictly true, i don't think that it is practically
true.

the number of distinct spelling forms in romance languages is very
large and there is a rich and fairly simple (for the most part)
morphology.

also, the result is that the number of tags required for a traditional
style tagger is quite a bit larger than for english.

the result of these considerations is that the amount of training
material is much larger than strictly necessary.

these problems can be largely avoided by tagging with tuples rather
than single tags and by doing some simple morphology so that each word
in the source text is considered to be a pair of stem+apparent
morphology. the tag tuples can contain the gross part of speech
(noun, verb ...) plus additional information (tense, gender, number).
the advantage here is that the statistical model being learned can be
considerably simpler. for instance, where the ending of a particular
word strongly limits the part of speech of a word, our statistical
model can learn this fact in a relatively universal manner instead of
learning this fact over again for each word.

the advantages obtained in this manner can be immense (orders of
magnitude decrease in the size of the statistical model and amount of
needed training data).