Re: Corpora: POS tagger for Tamil

Arash Zeini (ar.zeini@uni-koeln.de)
Wed, 24 Mar 1999 15:14:23 -0800

Dear Gregory,

At 10:24 24.03.99 -0500, you wrote:
>I don't know any Tamil, or anything about the morphology of Tamil. But
>I infer from the above statement that Tamil has somewhat complicated
>verb conjugation. (therefore requiring some morphological analysis
>prior to tagging.)

Yes you are right. A morphological analysis is required and that is
actually what my macro does. Being an agglutinating language Tamil has a
somehow very clear morphology or in this case a verb formation. Recognizing
the suffixes is not very difficult and I have already a lexicon with these
information.

>One question I have is: How ambiguous is (written) Tamil with respect to
>part of speech? i.e. are there (frequent) cases of words such as
>English "present" which can be (for example) both a noun and a verb? If
>written Tamil is unambiguous you may not need a statistical
>disambiguation step -- just the morphological analysis.

There are some cases where "verbal categories" could be ambiguous. I can't
give you any frequencies but the adjectival participle of the future tense
for neuter for example is identical with the finite form of the 3. person
future neuter.
The simple verb conjugation in the three tenses however is "all in all"
clear and doesn't give much reason for ambiguity.
I can only guess that we will need a statistical disambiguation at one stage.

Thank you very much for your suggestions. I will check the links.

With best wishes,
Arash