Re: Corpora: Corpus size

From: Marco Antonio Esteves da Rocha (marcor@cce.ufsc.br)
Date: Mon Jun 11 2001 - 20:32:25 MET DST

  • Next message: Mari Olsen: "Corpora: Japanese Lexicographer (temporary: up to 10 months)"

    On Sun, 3 Jun 2001, Norbert Schlueter wrote:

    > Dear all,
    >
    > size, i.e. number of words, is obviously not the only factor when
    > compiling a corpus for special investigations. Far more important
    > seems to be to get at least 400 cases of whatever you are looking for.
    > It can be shown that even in the worst case of a balanced distribution
    > when looking at a variable with two values [e.g. ASPECT:
    > progressive/non-progressive --> 50%/50%] the results will be
    > significant at the alpha=0.05 level (n = (4*p*(1-p))/alpha^2). I
    > wonder if anyone has done some work on this and can comment on the
    > number of necessary cases if the variable has got more than two values
    > (e.g. SUBJECT: 1PSG, 2PSG, etc.)
    >

    One rule of thumb commonly included in statistics textbooks in
    cross-tabulations is:

    - aim for a minimum of ten cases per cell
    - add ten cases for every four cells

    Thus, a 2 x 2 table would require a sample size of 50 cases (4 cells x 10
    cases + 10 cases = 50)

    A 3 x 2 table: 6 cells x 10 cases + 15 (10 + 5) cases = 75 cases

    A 4 x 4 table: 16 cells x 10 cases + 40 cases = 200 cases

    A 20 x 10 table: 200 cells x 10 cases + 500 cases = 2500 cases

    This assumes you are cross-tabulating two variables.

    It is not particularly sophisticated, but it is reliable in most designs.

    What I find somewhat risky is using sample size to reach significance. Of
    course there is plenty of debate about that in the literature, but running
    an association test definitely improves the reliability of results
    concerning relationships between two variables.

    The SUBJECT design above might be dealt with by cross-tabulating SUBJECT
    by NON-SUBJECT having 1PSG, 2PSG, etc., as categories classifying cases in
    each of those two variables. If I understood the idea correctly, sample
    size required would be:

    12 cells (3 persons, singular and plural)(SUBJ,NONSUBJ) X 10 + 30 (3
    groups of four cells X 10) = 150 cases

    A little more economical as compared to 400 cases. It might yield
    significance results which are not so strogly influenced by sheer sample
    size, and this may possibly be more reliable, although I am not so sure
    about that. I would still prefer to check association with tau or some
    other association measure thought to be more adequate.

    Marco Rocha
    marcor@cce.ufsc.br
        



    This archive was generated by hypermail 2b29 : Mon Jun 11 2001 - 18:31:12 MET DST