Corpora: minimum size of corpus?

From: Daniel Riaño (danielrr@retemail.es)
Date: Thu Feb 10 2000 - 15:36:16 MET

  • Next message: Nancy M. Ide: "Corpora: Call for participation: meeting on annotation and software standards"

            This is a very interesting thread. I'd like to ask the List another
    question related with it (three questions indeed).

            Let's suppose we have a large text corpus of Greek text (or any
    text of a non expansible corpus), and we want to do a grammatical analysis
    of a part of it for a study on a grammatical category (like case, modus,
    number, etc.) from the syntactical point of view. For the analysis we'll
    use a computer editor that helps the human linguist to tag the text in
    every imaginable way. The analyst does a complete morphological and
    semantic description of every word of the text, a skeleton parsing of every
    sentence, puts a tag to every syntagm indicating its function, plus more
    information about anaforic relations, etc, etc. This corpus is homogeneous:
    I mean it is written by only one author in a given period of his life,
    without radical departures from the main narrative, either in style or in
    the subject. Now the (first) question: what is the minimum percentage of
    such corpus we must analyse in order that we may confidently extrapolate
    the results of our analysis to the whole corpus?. I bet staticians have an
    (approximate) answer for that. Bibliography? I also understand that it may
    be probably methodologically preferable to analyse
    several portions of the same size from the text, instead of parsing only
    one longer chunk of continuous text. And the third question: for such a
    project, what would be the minimum size of the analysed corpus? Any help
    welcome.

    ~~~~~~~~~~~~~~~~~~~
    Daniel Riaño Rufilanchas
    Madrid, España

    Por favor, tomad nota de la nueva dirección de correo: danielrr@retemail.es



    This archive was generated by hypermail 2b29 : Thu Feb 10 2000 - 15:35:59 MET