Re: Corpora: minimum size of corpus?

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Fri Feb 11 2000 - 02:35:59 MET

  • Next message: Termilat: "Corpora: Conférence pour une infrastructure terminologique en Europe"

    > This is a very interesting thread. I'd like to ask the List another
    > question related with it (three questions indeed).
    >
    > Let's suppose we have a large text corpus of Greek text (or any
    > text of a non expansible corpus), and we want to do a grammatical analysis
    > of a part of it for a study on a grammatical category (like case, modus,
    > number, etc.) from the syntactical point of view. For the analysis we'll
    > use a computer editor that helps the human linguist to tag the text in
    > every imaginable way. The analyst does a complete morphological and
    > semantic description of every word of the text, a skeleton parsing of every
    > sentence, puts a tag to every syntagm indicating its function, plus more
    > information about anaforic relations, etc, etc. This corpus is homogeneous:
    > I mean it is written by only one author in a given period of his life,
    > without radical departures from the main narrative, either in style or in
    > the subject

    > Now the (first) question: what is the minimum percentage of
    > such corpus we must analyse in order that we may confidently extrapolate
    > the results of our analysis to the whole corpus?. I bet staticians have an
    > (approximate) answer for that. Bibliography? I also understand that it may
    > be probably methodologically preferable to analyse
    > several portions of the same size from the text, instead of parsing only
    > one longer chunk of continuous text. And the third question: for such a
    > project, what would be the minimum size of the analysed corpus? Any help
    > welcome.

    I am not a statistican. However, my view is that the size (even for a
    homogenous corpus) depends on the outcome unit. For example, the corpus size
    for estimating the prob. distribution of word occurrence and sentence structure
    occurrence are quite different. The former uses a word as a unit and the latter
    uses a sentence as a counting unit.

    Because the min size depends on the outcome unit and
    what we want to infer (e.g. distribution, probabilities, etc.), the approx.
    answer and techniques to derive them will also differ. As others have pointed
    out, the general rule is that we should have as large as possible so that we
    can always get enough data for most if not all of the investigation. If we
    are restricted with data size, perhaps we need to work out what analysis to
    be do and what confidence limit can be reached to make certain inference.

    I remember that there is a book called Sampling (Techniques?) published
    by John Wiley. It has many sampling techniques (like bootstrapping) and
    discuss how the size
    of the sample is determined for a given level of confidence. I remember
    that the sample size also depends on the value of the proportion and
    an inversion technique is used for small proportions. I am not sure whether
    the Handbook of statistics has anything relevant.

    Regards,

    Robert Luk



    This archive was generated by hypermail 2b29 : Fri Feb 11 2000 - 02:40:32 MET