Corpora: New parsing technology

From: Vlad V. Gojol (gojol@rnc.ro)
Date: Mon Jan 21 2002 - 18:08:38 MET

  • Next message: Mahtab Nikkhou: "Corpora: Lettre d’information Euromap Technologies de la Langue - Janvier 2002"

       Dear list members ,

       May I announce the new version of my parser . It is based on
    a part-of-speech tagger considered by those who tested it as the
    best for German ( it is currently licensed in Germany ) . On the
    Negra corpus , it gets an error rate of 2% , compared to the 3.4%
    reported by DFKI's TnT under comparable conditions . For the
    English corpus Susanne , with the tagset reduced to a manageable
    size , the error rate drops down to 1.5% .
       The parser is the only statistic one ( according to a Sigparse
    official ) competitive in terms of accuracy with the top grammar-
    based ones . The comparison with such a state-of-the-art parser
    ( regarded by many as the best on the market ) , web-testable ,
    showed a slight advantage for mine in terms of accuracy and a
    big one ( cca ten times ) in terms of speed . This was got by
    training , for German , on a quite small corpus : the first
    2,000 sentences ( 35,000 words ) of Negra ( from which 10,000
    rules were deduced ) ; for English - on 60,000 words of Susanne
    ( 20,000 rules ) . This software is actually a parser generator
    permitting the creation of specific parsers for any language
    within a short while : just as required for annotating a text
    equivalent to 40-60 pocket-book pages ( i.e. a student-level
    work for a couple of months ) . At present it runs even without
    any lexicon ( except the tiny one extracted from the respective
    corpora : 33,000 word forms for German , 12,000 for English ) .
    There are three output files : treebank , dependency-oriented
    and graphic . It is licensed or under the process of licensing
    in several institutes / universities from Switzwerland , Italy
    and Germany .
       Linux and Windows demos exist for German and English ,
    deliverable on demand at gojol@sunu.rnc.ro , with a limited
    operating availability ( three days ) . It may be discussed :
    building versions for other tagsets or languages ( French ,
    Spanish , Italian ) , prolonging the system towards integration
    into specific customer applications .
       Would you reply only personally ( at gojol@sunu.rnc.ro ) .
       Regards ,
                 Dr.ing. Vlad Gojol



    This archive was generated by hypermail 2b29 : Mon Jan 21 2002 - 20:35:47 MET