Re: Corpora: minimum size of corpus?

From: Gabriel Pereira Lopes (gpl@di.fct.unl.pt)
Date: Mon Feb 14 2000 - 20:10:28 MET

  • Next message: Shinya Nakagawa: "Corpora: Job Opportunities in Natural Language Processing"

    We used a corpus with aproximately 5,000 tagged words for training a neural net
    based tagger and the results we have obtained on tagging precisison were quite
    high (94 to 96 % precision, on a text that was rather faulty) later we used
    that tagger to tag text from another collection which was hand corrected (here
    we used a corpus of a different style, with aproximately 20,000 tagged words
    hand corrected) and we retrained our tagger and got 98% precision for well
    written text. Using both hand corrected corpora for training a new tagger gave
    rise to worth precision results, but the texts were of different genres. From
    such small corpora lots of things can be learned...

     [ML96a] Nuno Marques, José Gabriel Lopes. Using Neural Nets for Portuguese
    Part-of-Speech Tagging. In Proceedings of the Fifth International Conference on
    The Cognitive Science of Natural Language Processing. Dublin City University,
    September 2-4 (9 pages). 1996.
    [ML96b] Nuno Marques, José Gabriel Lopes. A Neural Network Approach for
    Part-of-Speech Tagging. In Proceedings of the Second Workshop on Spoken and
    Written Portuguese Language Processing, , pp. 1-9, Curitiba, Brazil, October
    21-22. 1996.

    These results contrast with results obtained by usinh Hidden Markov models:

    ? [VMLV95] A. Vilavicencio, N. Marques, G. Lopes, F. Vilavicencio.
    Part-of-Speech Tagging for portuguese Texts. In Jacques Wainer e Ariadne
    Carvalho, editors, Advances in Artificial Intelligence: Proceedings of the XII
    Brasilian Symposium on Artificial Intelligence, Lecture Notes in Artificial
    Intelligence 991, paginas 323-332, Campinas, October 11-13. Springer Verlag.
    1995.

    And enabled us to automatically tag a whole collection of 40,000,000 words and
    use that collection for extracting subcategorization verb frames which were
    evaluated and enabled us to identify patterns of tagging errors and obtain
    precisions for this extraction higher than 90%.

    MLC98a] Nuno Marques, José Gabriel Lopes e Carlos Agra Coelho. Learning Verbal
    Transitivity Using LogLinear Models. In Claire Nédellec and Céline Rouveirol,
    editores, Proceedings of the 10th European Conference on Machine Learning,
    Lecture Notes in Artificial Intelligence 1398, pp. 19-24, Chemnitz, Germany.
    Springer Verlag. 1998.
    [MLC98b] Nuno Marques, José Gabriel Lopes e Carlos Agra Coelho. Using Loglinear
    Clustering for Subcategorization Identification. In Jan M. Zytkow and Mohamed
    Quafafou, editores, Proceedings of the Second European Conference on Principles
    of Data Mining and Knowledge Discovery, Lecture Notes in Artificial
    Intelligence 1510, pp. 379-387, Nantes, France. Springer Verlag. 1998.

    On this subject matter there is a PhD thesis that you can consult (in
    Portuguese) through the web. contact Nuno Marques (nmm@di.fct.unl.pt). ""Uma
    Metodologia Para a Modelação Estatística da Subcategorização Verbal (A
    methodology for statistcal modelling of verbal subcategorization)". It was
    defended quite recently.

    In conclusion, a lot of work can be done with rather small corpora. It depends
    on what we want to extract from it and on the methods used to automatically
    learn from it.

    Best regards,

    Gabriel Pereira Lopes

    Daniel Riaño wrote:

    > This is a very interesting thread. I'd like to ask the List another
    > question related with it (three questions indeed).
    >
    > Let's suppose we have a large text corpus of Greek text (or any
    > text of a non expansible corpus), and we want to do a grammatical analysis
    > of a part of it for a study on a grammatical category (like case, modus,
    > number, etc.) from the syntactical point of view. For the analysis we'll
    > use a computer editor that helps the human linguist to tag the text in
    > every imaginable way. The analyst does a complete morphological and
    > semantic description of every word of the text, a skeleton parsing of every
    > sentence, puts a tag to every syntagm indicating its function, plus more
    > information about anaforic relations, etc, etc. This corpus is homogeneous:
    > I mean it is written by only one author in a given period of his life,
    > without radical departures from the main narrative, either in style or in
    > the subject. Now the (first) question: what is the minimum percentage of
    > such corpus we must analyse in order that we may confidently extrapolate
    > the results of our analysis to the whole corpus?. I bet staticians have an
    > (approximate) answer for that. Bibliography? I also understand that it may
    > be probably methodologically preferable to analyse
    > several portions of the same size from the text, instead of parsing only
    > one longer chunk of continuous text. And the third question: for such a
    > project, what would be the minimum size of the analysed corpus? Any help
    > welcome.
    >
    > ~~~~~~~~~~~~~~~~~~~
    > Daniel Riaño Rufilanchas
    > Madrid, España
    >
    > Por favor, tomad nota de la nueva dirección de correo: danielrr@retemail.es



    This archive was generated by hypermail 2b29 : Mon Feb 14 2000 - 20:12:37 MET