RE: Corpora: POS tagger

From: Martin Wynne (martin.wynne@ota.ahds.ac.uk)
Date: Tue Apr 30 2002 - 17:35:57 MET DST

Next message: Diego Molla: "Re: Corpora: Counting semantic propositions (was Relatve text length)"

Previous message: David Grant: "Re: Corpora: POS tagger"
Maybe in reply to: Nicole Baumgarten: "Corpora: POS tagger"
Next in thread: Atro Voutilainen: "Re: Corpora: POS tagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I think that, as you suspected, the specification you require is Utopia! I
can see at least four problems with it:

(i) Complete accuracy in automatic procedures is not possible because there
are (in most types of text) a significant number of cases where semantic
ambiguity (mapping onto POS ambiguity) can only be resolved by contextual
knowledge;

(ii) POS-tagging is not a straightforward task with clearly defined
procedures and rules which are accepted and agreed on in the community. Part
of speech categories are based on a bundle of different levels of
categorisation (semantic, syntactic, morphological), and there are many
different theories underlying these systems as well as different ways of
applying the theories. Even if you decide on the theoretical underpinning,
there will be conflicts between differents levels of classifying words (e.g.
"it looks like a verb, but it functions like an adjective, yet it's meaning
is like a noun"). So tagging can't help being in some senses arbitrary and
inconsistent, so even if you claim >99% accuracy, no-one else will agree
with you. And if you were thinking of having the same or compatible tagsets
for all the languages, then you are multiplying these problems. (This isn't
an argument against POS-tagging per se, rather a plea for clear guidelines
and good documentation);

(iii) You would need not one but four programs for tagging four languages.
While you might get some success with one tagging algorithm, there will be
important resources such as lexicons, morphological rules, transition
probabilities, algorithms for identifying word boundaries, sentence
boundaries, foreign words, proper nouns etc. and these will be different for
each language. And if you want to achieve an extremely high level of
accuracy, it is actually unlikely that the same approach would work for the
four languages you mention. I think you'd get better results by using the
best tagger for each individual language.

(iv) You'll probably find that some of the important resources in this field
don't work under Windows, as they will have been developed under Unix.

Sorry if this is all rather negative - good luck!

Best,
Martin

__
Martin Wynne
martin.wynne@ota.ahds.ac.uk
Linguistics Officer
Oxford Text Archive

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275

-----
From: Nicole Baumgarten [mailto:se9d030@rrz.uni-hamburg.de]
Sent: 30 April 2002 15:30
To: corpora@hd.uib.no
Subject: Corpora: POS tagger

Dear all,

does anybody know of an automatic, plus-99 per cent accuracy (utopia?),
unidiosyncratic, easy-to-apply POS tagger that can handle German, English
(French, Spanish) and works in an ordinary Windows environment?
ANY ideas are greatly appreciated!

All the best
Nicole.

------------------------------------------
Nicole Baumgarten
SFB 538 Mehrsprachigkeit
Covert Translation
Max-Brauer-Allee 60
22765 Hamburg
nicole.baumgarten@uni-hamburg.de
++49-40-42838 6453

Next message: Diego Molla: "Re: Corpora: Counting semantic propositions (was Relatve text length)"
Previous message: David Grant: "Re: Corpora: POS tagger"
Maybe in reply to: Nicole Baumgarten: "Corpora: POS tagger"
Next in thread: Atro Voutilainen: "Re: Corpora: POS tagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 30 2002 - 18:01:11 MET DST