RE: Corpora: POS tagger

From: Martin Wynne (
Date: Tue Apr 30 2002 - 17:35:57 MET DST

  • Next message: Diego Molla: "Re: Corpora: Counting semantic propositions (was Relatve text length)"

    I think that, as you suspected, the specification you require is Utopia! I
    can see at least four problems with it:

    (i) Complete accuracy in automatic procedures is not possible because there
    are (in most types of text) a significant number of cases where semantic
    ambiguity (mapping onto POS ambiguity) can only be resolved by contextual

    (ii) POS-tagging is not a straightforward task with clearly defined
    procedures and rules which are accepted and agreed on in the community. Part
    of speech categories are based on a bundle of different levels of
    categorisation (semantic, syntactic, morphological), and there are many
    different theories underlying these systems as well as different ways of
    applying the theories. Even if you decide on the theoretical underpinning,
    there will be conflicts between differents levels of classifying words (e.g.
    "it looks like a verb, but it functions like an adjective, yet it's meaning
    is like a noun"). So tagging can't help being in some senses arbitrary and
    inconsistent, so even if you claim >99% accuracy, no-one else will agree
    with you. And if you were thinking of having the same or compatible tagsets
    for all the languages, then you are multiplying these problems. (This isn't
    an argument against POS-tagging per se, rather a plea for clear guidelines
    and good documentation);

    (iii) You would need not one but four programs for tagging four languages.
    While you might get some success with one tagging algorithm, there will be
    important resources such as lexicons, morphological rules, transition
    probabilities, algorithms for identifying word boundaries, sentence
    boundaries, foreign words, proper nouns etc. and these will be different for
    each language. And if you want to achieve an extremely high level of
    accuracy, it is actually unlikely that the same approach would work for the
    four languages you mention. I think you'd get better results by using the
    best tagger for each individual language.

    (iv) You'll probably find that some of the important resources in this field
    don't work under Windows, as they will have been developed under Unix.

    Sorry if this is all rather negative - good luck!


    Martin Wynne
    Linguistics Officer
    Oxford Text Archive

    Oxford University Computing Services
    13 Banbury Road
    UK - OX2 6NN
    Tel: +44 1865 283299
    Fax: +44 1865 273275

    From: Nicole Baumgarten []
    Sent: 30 April 2002 15:30
    Subject: Corpora: POS tagger

    Dear all,

    does anybody know of an automatic, plus-99 per cent accuracy (utopia?),
    unidiosyncratic, easy-to-apply POS tagger that can handle German, English
    (French, Spanish) and works in an ordinary Windows environment?
    ANY ideas are greatly appreciated!

    All the best

    Nicole Baumgarten
    SFB 538 Mehrsprachigkeit
    Covert Translation
    Max-Brauer-Allee 60
    22765 Hamburg
    ++49-40-42838 6453

    This archive was generated by hypermail 2b29 : Tue Apr 30 2002 - 18:01:11 MET DST