[Corpora-List] Are Corpora Too Large?

From: Amsler, Robert (Robert.Amsler@hq.doe.gov)
Date: Wed Oct 02 2002 - 14:38:18 MET DST

  • Next message: Joybrato Mukherjee: "[Corpora-List] FYI: New Book Series"

    Heresy! But hear me out.

    My question is really whether we're bulking up the size of corpora vs.
    building them up to meet our needs.

    Most of the applications of corpus data appear to me to be lexical or
    grammatical, operating at the word,
    phrase, sentence or paragraph level. We want examples of lexical usage,
    grammatical constructions, perhaps
    even anaphora between multiple sentences. I haven't heard many talk about
    corpora as good ways to study
    the higher level structure of documents--largely because to do so requires
    whole documents and extracts
    can be misleading even when they have reached 45,000 words in size (the
    upper limit of samples in the British
    National Corpus).

    The main question here is if we are seeking lexical variety, if the lexicon
    basically consists of Large Numbers
    of Rare Events (LNREs), then why aren't we collecting language data to
    maximize the variety of that type of
    information rather than following the same traditional sampling practices of
    the earliest corpora?

    In the beginning, there was no machine-readable text. This meant that
    creating a corpus involved typing in text
    and the amount of text you could put into a corpus was limited primarily by
    the manual labor available to enter
    data. Because text was manually entered, one really couldn't analyze it
    until AFTER it had been selected for
    use in the corpus. You picked samples on the basis of their external
    properties and discovered their internal
    composition after including them in the corpus.

    Today, we largely create corpora based on obtaining electronic text and
    sampling from that text. This means that
    we have the additional ability to examine a lot of text before selecting a
    subset to become part of the corpus.
    While external properties of the selected text are as important as ever and
    should be representative of what types
    of text we feel are appropriate to "balance" the corpus, the internal
    properties of the text are still taken
    almost blindly, with little note of whether a sample increases the variety
    of lexical coverage or not.

    The question is whether we could track the number of new terms appearing in
    potential samples from a new source
    and optimally select the sample that added the most new terms to the corpus
    without biasing the end result.
    In my metaphor, whether we could add muscle to the corpus rather than just
    fatten it up.

    This also raises the question of why have sample sizes grown so large? The
    Brown corpus created a million words from
    500 samples of 2000 words each. Was 2000 words so small that everyone was
    complaining about how it stifled their
    ability to use the corpus? Or is it merely that given we want 100 million
    words of text it is far easier to
    increase the sample sizes by 20-fold than find 20 more sources from which to
    sample.



    This archive was generated by hypermail 2b29 : Wed Oct 02 2002 - 14:54:06 MET DST