Re: Spanish

Wu Zhibiao (wzb@unagi.cis.upenn.edu)
Mon, 17 Apr 1995 10:11:37 -0400 (EDT)

>
> these problems can be largely avoided by tagging with tuples rather
> than single tags and by doing some simple morphology so that each word
> in the source text is considered to be a pair of stem+apparent
> morphology. the tag tuples can contain the gross part of speech
> (noun, verb ...) plus additional information (tense, gender, number).
> the advantage here is that the statistical model being learned can be
> considerably simpler. for instance, where the ending of a particular
> word strongly limits the part of speech of a word, our statistical
> model can learn this fact in a relatively universal manner instead of
> learning this fact over again for each word.
>
> the advantages obtained in this manner can be immense (orders of
> magnitude decrease in the size of the statistical model and amount of
> needed training data).
>
LDC's FST package is doing exactly the above mophological analysis.
Linguistic Data Consortium at Upenn has a LDC-FST software package
for its Comlex memebers. Right now, the first version of the software
and its Spanish morphological transducer description has been released.
The implementation is based on finite state transducer theory. The
software is written in C and run on SUN Unix.

For more infomation, Send email to ldc@unagi.cis.upenn.edu.

Best,
Zhibiao Wu