Corpora: New tagger

From: Gojol Vlad (gojol@sunu.rnc.ro)
Date: Thu Jan 01 1970 - 03:00:00 MET

  • Next message: Tadeusz Piotrowski: "Corpora: language engineering"

       Dear Colleagues ,

       Those intersted in a new generation part-of-speech tagger will be
    welcome
    when addressing their reflexions to gojol@sunu.rnc.ro , especially if
    having
    purchasing or collaboration intents ( or hints about ) . Thank you .
       Best regards ,

    Vladimir V. Gojol

    Senior Software Engineer
    Institutul National de Informatica
    Bucuresti , Romania

    ............................................................................

       I created a part-of-speech tagger with an unusual capacity of dealing

    with large contexts , especially for German . I used Negra ( seemingly
    the
    best known German corpus , with free obtainable licence ) . The tagger
    currently reputed as being the most accurate for German is perhaps TnT .
    It
    reports upon this corpus an error rate of 3.4% . But I have found a
    syste-
    matic error in Negra : all the occurences of the auxilliary verbs are
    tagged
    as auxilliary ( VAFIN ) , though in 50% of the cases they function as
    finite
    verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens )
    .
    In this more correct environment ( where the performance of TnT should
    be
    probably around 4.5% ) , my tagger gets 1.7% .
       On another German corpus ( I call it X ) , with comparable contents (
    news-
    paper articles ) and tagset , but with attached exterior lexicon ( i.e.
    not
    extracted from the corpus ) , the result is 2.4% .
       I also used Susanne ( the only English corpus I could get free ) .
    The re-
    ported result for TnT is 3.8% . Mine is 2.8% . On the "A" texts , best
    paral-
    lelable with those in Negra , as journalistic , it's 2.3% .
       Initially I had used a Romanian corpus , with a result of 0.9% (
    compared
    to 1.7% , 2.5% and 4.2% respectively got by the Xerox , Birmingham and
    Brill
    taggers ) .

       The speed is comparable to that of TnT and modifiable by parameter
    setting ,
    in reverse proportion to the accuracy ( but without affecting it much )
    .
       The incremental operating mode and the data structures segmentation
    allow
    running on very small memory computers .
       There is the advantage of an intuitive output ( no hostile binary
    matrix ) ,
    in a form analogue to the input of some expert systems .
       Special facilities exist , such as virtual tags , or context
    essentialisa-
    tion ( permitting to get the minimal contexts set characteristic to a
    certain
    linguisic style , useful not only for maximum accuracy and speed ) etc.

       All is built on two essentially new concepts : organicity and context
    pro-
    pagation . I didn't publish anything about them , to keep up their
    commercial
    appeal . The accuracy comparable to that of manual tagging made me find
    many
    errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G.
    Sampson
    replied gratefully , saying that it's the first time somebody reports
    more
    than 2 errors , and that my findings make necessary a new version of
    Susanne .
    The handling of very large contexts could even modify the current
    tagsets de-
    sign , by cancelling some unnatural decisions ( motivated only by the
    incapa-
    city of the existing taggers to see beyond a 3-tokens neighborhood ) ,
    such as
    those concerning the auxilliary verbs , participles etc. - so removing
    some
    burden from the subsequent stages of text processing .
       It is written in C ( Linux ) . Demos for German ( Negra or X ) and
    English
    ( Susanne ) are available .



    This archive was generated by hypermail 2b29 : Wed Feb 16 2000 - 17:30:05 MET