RE: Corpora: when does a subcorpus become a corpus

Date: Sat Jan 05 2002

    This is an interesting discussion about 'representativeness'
    of corpus and subcorpus. I'll add my 2 cents here. Surely,
    statisticans have been concerned about getting representative
    samples for some time and mechanisms available, though not
    perfect, to address the above issue. The one that I can
    think of is sequential (and stratified?) sampling.

    Suppose we have infinite resources! And suppose we have
    a (random or otherwise) sequence of subcorpora s1, s2, ...
    sn and their associated distribution that we observe for
    any specific purpose d1, d2, ..., dn. The distribution could
    be words, the number of different meanings of a word, etc.
    Then, we do a sequential sampling as follows:

    Let the merged distribution Di be defined recusrively
    as follows:

    D1 := d1
    Di := Di-1 + di

    where + is merging two distributions. The sequential
    sampling could stop if

    Chi-Square of Di and Di-1 is not statistically significantly
    different at X%.

    There is a possibility that the sequential sampling
    could never stop.

    Obviously, more sophistcated techniques could be
    applied and more complicated modeling may be needed
    (e.g. taking time into account of the sampling as
    language changes may take place).


    Robert Luk

