Re: Corpora: Parsing morphologically rich languages

From: Gabriel Pereira Lopes (gpl@di.fct.unl.pt)
Date: Tue Jan 16 2001 - 15:43:44 MET

  • Next message: Beatrice daille: "Corpora: TAL: Final Call for papers special issue in Corpus Linguistics"

    Dear Alexander,

    To work with morphologically rich languages does not necessarily require
    a large number of POS-tags, as this would require very large hand tagged
    corpora in order to train taggers and have text automatically
    POS-tagged.

    We work with approximately 40 tags for Portuguese and that does not
    inhibit our parser from starting with automatically POS-tagged text and
    find out the required morpho-syntactic information. Of course we are
    working with a kind of PROLOG (DyALog) that enables the construction of
    chart parsers quite effective (250 words/sec.) that moreover can be used
    for fault finding and fault repairing. Have alook at the web page by
    Vitor Rocio (http://pc-gpl.di.fct.unl.pt/~vjr/) from where you can pick
    up some of our publications (Publicações) on this subject matter. See:

    Vitor Rocio and J.G.P. Lopes. 1999. "An infra-structure for diagnosing
    causes for partially
        parsed natural language input". In: ACTAS-I VI Simposio
    Internacional de Comunicación
        Social (Proceedings of the 6th International Symposium on Social
    Communication). Santiago
        de Cuba, January 25-28, 1999. Santiago de Cuba: Editorial Oriente
    (ISBN 959-11-0250-X). pp.
        550-554.
    and

    V. Rocio and J.G.P. Lopes. 1998. "Partial Parsing, Deduction and
    Tabling". In: B. Lang (ed.)
        Actes des premières Journées sur la Tabulation en Analyse Syntaxique
    et Déduction, April
        2-3,1998. Paris. Rocquencourt, France: INRIA. pp. 52-59.

    The paper:
    V. Rocio, E. de la Clergerie and J.G.P.Lopes. 2001. "Tabulation for
    multi-purpose partial
        parsing". Grammars. 4.1. Kluwer Academic Publishers. (to appear).

    you may ask it from Vitor Rocio.

    The work on POS tagging, using a neural-net based POS-tagger generator,
    just requires a small hand tagged corpus (5,000 words were enough) and a
    large lexicon. The precision we have got was aproximately 94% precise
    for very badly written Portuguese (without diacritics) and 98% precision
    for text more carefully written (for this experiemnt we used 20,000 hand
    corrected automatically POS-tagged text). This work was used for
    extracting subcategorization patterns. The literature we produced on
    this subject matter, written in English can be found at:

    Marques and Lopes and Coelho. (2000b). "Mining Subcategorization
    Information by Using Multiple Feature Loglinear Models". In Paola
    Monachesi (ed.) Computational Linguistics in the Netherlands 1999:
    selected papers from the Tenth CLIN Meeting. Amsterdam-Atlanta, GA 2000:
    Rodopi. Electronic version:
    http://www-uilots.let.uu.nl/publications/clin1999/papers.html.

    Marques and Lopes and Coelho (1998a). “Learning Verbal Transitivity
    using LogLinear Models”. In: Claire Nédelec and Céline Rouveirol (eds.).
    Machine Learning: ECML-98, 10th European Conference on Machine
    Learning, Chemitz, Germany April 21-23, 1998, Proceedings.Lecture Notes
    in Artificial Intelligence 1398. Berlin: Springer Verlag. pp. 19-24.

    Marques and Lopes and Coelho (1998b). “Using Loglinear Clustering for
    subcategorization identification”. In: J Zytkov and M.Quafafou (eds.)
    Principles of Data Mining and Knowledge Discovery, 2nd European
    Symposium, PKDD'98, Nantes, France September, 1998, Proceedings. Lecture
    Notes in Artificial Intelligence 1510. Berlin: Springer Verlag. pp.
    379-387.

    Marques e J.G.P. Lopes.1996. "Using Neural Nets for Portuguese
    Part-of-Speech Tagging". In: Proceedings of the Fifth International
    Conference on The Cognitive Science of Natural Language Processing
    Dublin City University, September 2-4, 1996.

    Best regards,

    Gabriel Pereira Lopes

    Best regards,

    Gabriel Pereira Lopes

    "Alexander Mikhailian

    > Hello,
    >
    > I am looking for references to syntactic parsers
    > that deal with morphologically rich flexive languages.
    >
    > In particular, I am interested in :
    >
    > 1. Approaches to deal with the number of POS tags
    > (terminals) that would supposedly be larger
    > than for English or French, e.g if one tries
    > to build a list of POS tags for a morphologically
    > rich language in order to follow approaches
    > developed for English, this list may easily grow up
    > to thousands of entries which implies that grammars
    > using such a huge list of terminals would be quite
    > complicated.
    >
    > 2. Approaches to deal with the free or loosely
    > restricted order of words that is often proper to
    > morphologically rich languages and which requires
    > different parsing techniques than for English,
    > where a common shift/reduce parser is often sufficient.
    >
    > Thanks in advance,
    >
    > --
    > Alexander Mikahilian



    This archive was generated by hypermail 2b29 : Tue Jan 16 2001 - 15:41:39 MET