Corpora: New tagger

From: Gojol Vlad (gojol@sunu.rnc.ro)
Date: Thu Jan 01 1970 - 03:00:00 MET

Next message: Tadeusz Piotrowski: "Corpora: language engineering"

Previous message: Termilat: "Re: Corpora: bare nouns in Italian"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Colleagues ,

Those intersted in a new generation part-of-speech tagger will be
welcome
when addressing their reflexions to gojol@sunu.rnc.ro , especially if
having
purchasing or collaboration intents ( or hints about ) . Thank you .
Best regards ,

Vladimir V. Gojol

Senior Software Engineer
Institutul National de Informatica
Bucuresti , Romania

............................................................................

I created a part-of-speech tagger with an unusual capacity of dealing

with large contexts , especially for German . I used Negra ( seemingly
the
best known German corpus , with free obtainable licence ) . The tagger
currently reputed as being the most accurate for German is perhaps TnT .
It
reports upon this corpus an error rate of 3.4% . But I have found a
syste-
matic error in Negra : all the occurences of the auxilliary verbs are
tagged
as auxilliary ( VAFIN ) , though in 50% of the cases they function as
finite
verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens )
.
In this more correct environment ( where the performance of TnT should
be
probably around 4.5% ) , my tagger gets 1.7% .
   On another German corpus ( I call it X ) , with comparable contents (
news-
paper articles ) and tagset , but with attached exterior lexicon ( i.e.
not
extracted from the corpus ) , the result is 2.4% .
   I also used Susanne ( the only English corpus I could get free ) .
The re-
ported result for TnT is 3.8% . Mine is 2.8% . On the "A" texts , best
paral-
lelable with those in Negra , as journalistic , it's 2.3% .
   Initially I had used a Romanian corpus , with a result of 0.9% (
compared
to 1.7% , 2.5% and 4.2% respectively got by the Xerox , Birmingham and
Brill
taggers ) .

   The speed is comparable to that of TnT and modifiable by parameter
setting ,
in reverse proportion to the accuracy ( but without affecting it much )
.
   The incremental operating mode and the data structures segmentation
allow
running on very small memory computers .
   There is the advantage of an intuitive output ( no hostile binary
matrix ) ,
in a form analogue to the input of some expert systems .
   Special facilities exist , such as virtual tags , or context
essentialisa-
tion ( permitting to get the minimal contexts set characteristic to a
certain
linguisic style , useful not only for maximum accuracy and speed ) etc.

All is built on two essentially new concepts : organicity and context
pro-
pagation . I didn't publish anything about them , to keep up their
commercial
appeal . The accuracy comparable to that of manual tagging made me find
many
errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G.
Sampson
replied gratefully , saying that it's the first time somebody reports
more
than 2 errors , and that my findings make necessary a new version of
Susanne .
The handling of very large contexts could even modify the current
tagsets de-
sign , by cancelling some unnatural decisions ( motivated only by the
incapa-
city of the existing taggers to see beyond a 3-tokens neighborhood ) ,
such as
those concerning the auxilliary verbs , participles etc. - so removing
some
burden from the subsequent stages of text processing .
It is written in C ( Linux ) . Demos for German ( Negra or X ) and
English
( Susanne ) are available .

Next message: Tadeusz Piotrowski: "Corpora: language engineering"
Previous message: Termilat: "Re: Corpora: bare nouns in Italian"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Feb 16 2000 - 17:30:05 MET