[Corpora-List] Re: Are Corpora Too Large?

From: Ramesh Krishnamurthy (ramesh@easynet.co.uk)
Date: Thu Oct 03 2002 - 10:23:10 MET DST

  • Next message: KERREMANS, Koen: "[Corpora-List] term extraction info"

    Dear Robert (if I may)

    Thank you for a stimulating contribution!

    You raise too many interesting and complex issues for me to
    reply to adequately, as I am about to leave for the airport.

    However, here are one or two points which came readily to mind
    (obviously, I'm going to disagree with you about corpus size, but
    one does not often get a chance to reinspect the argumentation).

    > We want examples of lexical usage
    > grammatical constructions, perhaps
    > even anaphora between multiple sentences.
    Pragmatics and discourse organizers, and even
    semantics, often need a substantial context to
    make it clear what is going on.

    Most current corpora are - for purely technological reasons -
    heavily biased towards the written word. We wuld obviously like to
    have at least an equal amount of speech, before we can know about
    features of the spoken language which may also require substantial
    context.

    > I haven't heard many talk about
    > corpora as good ways to study
    > the higher level structure of documents--largely because to do so requires
    > whole documents and extracts
    > can be misleading even when they have reached 45,000 words in size (the
    > upper limit of samples in the British
    > National Corpus).

    Not all corpora use text extracts. Cobuild/Birmingham has always used entire texts
    wherever possible; although the term "text" is itself problematic: do we treat a whole
    issue of a newspaper as a single text, or a collection of smaller texts, i.e. articles?
    Each article has a certain unity, but so does each issue (in-house editorial policies,
    the day's topics, etc).

    > The main question here is if we are seeking lexical variety, if the lexicon
    > basically consists of Large Numbers
    > of Rare Events (LNREs), then why aren't we collecting language data to
    > maximize the variety of that type of
    > information rather than following the same traditional sampling practices of
    > the earliest corpora?
    Some of us may want to test the hypothesis that the lexicon consists of
    LNREs. It is often possible to group the LNREs into sets, groups, or classes
    of various kinds, which share some behavioural properties.
    Also, some of us may want to know more in detail about the opposite
    (SNFEs? Small Numbers of Frequent Events?). A few years back, while
    trying to find examples for the difference between British and American usage
    of "have" and "take" , I discovered the financial expression "take a bath" (unknown
    to me at the time, and not recorded in any of the reference works I had access to).
    So rare events may be going on undiscovered in the bulk of what we superficially
    took to be a frequent event.

    > Because text was manually entered, one really couldn't analyze it
    > until AFTER it had been selected for use in the corpus.
    >You picked samples on the basis of their external
    > properties and discovered their internal
    > composition after including them in the corpus.

    As far as I know, most of the software I use to analyse corpus data
    needs the data to be in the corpus before it can perform the analysis.
    This may be easily redesigned, but that is beyond my knowledge. If
    it is easy, I'm surprised that some of my enterprising software colleagues haven't
    done it already. Of course, part of the analysis consists of seeing what effect the
    arrival of new data has had on the whole corpus, which you couldn't do if you
    analysed the new data separately.

    I'm sure experts on seals do not object to checking each seal that comes into
    their survey area, just because they have seen seals before, or even if they have seen
    the same seal many times before. It may always offer something new, or at least serve
    to confirm hypotheses which are well established. Corpora also exist to confirm in
    a more robust way ideas we may have had about language for centuries. It helps us
    to draw a finer distinction betwen the invariable and the variable. I can make some
    statements about English with greater confidence from a 450m word corpus
    than I could from a 1m corpus. Of course, I may also have gained some insights
    during the time it has taken to increase the corpus by this amount. So there may also
    be qualitative improvements in our analyses.

    > with little note of whether a sample increases the variety
    > of lexical coverage or not.

    > The question is whether we could track the number of new terms appearing in
    > potential samples from a new source
    > and optimally select the sample that added the most new terms to the corpus
    > without biasing the end result.
    > In my metaphor, whether we could add muscle to the corpus rather than just
    > fatten it up.

    You seem to be overly concerned with lexical variety and new items.
    Many of us are quie happy just to know a little bit more about the old items.
    Every linguistic statement deserves to be reinvestigated, especially those
    that we may have taken as axiomatic in the past. The increasing size of
    corpora adds not only breadth, but also depth.

    > This also raises the question of why have sample sizes grown so large? The
    > Brown corpus created a million words from
    > 500 samples of 2000 words each. Was 2000 words so small that everyone was
    > complaining about how it stifled their
    > ability to use the corpus? Or is it merely that given we want 100 million
    > words of text it is far easier to
    > increase the sample sizes by 20-fold than find 20 more sources from which to
    > sample.

    Ideally, surely we would want to do both. Depth and breadth again.

    Best
    Ramesh

    ----- Original Message -----
    X-Server-Uuid: 0bf4d294-faec-11d1-a39a-0008c7246279
    From: "Amsler, Robert" <Robert.Amsler@hq.doe.gov>
    To: corpora@hd.uib.no
    Subject: [Corpora-List] Are Corpora Too Large?
    X-WSS-ID: 118436181279709-02-02
    X-checked-clean: by exiscan on alf
    X-Scanner: 5f1f8efb186516f548a370690b11567e http://tjinfo.uib.no/virus.html
    X-UiB-SpamFlag: NO UIB: -0.8 hits, 8 required;

    Heresy! But hear me out.

    My question is really whether we're bulking up the size of corpora vs.
    building them up to meet our needs.

    Most of the applications of corpus data appear to me to be lexical or
    grammatical, operating at the word,
    phrase, sentence or paragraph level. We want examples of lexical usage,
    grammatical constructions, perhaps
    even anaphora between multiple sentences. I haven't heard many talk about
    corpora as good ways to study
    the higher level structure of documents--largely because to do so requires
    whole documents and extracts
    can be misleading even when they have reached 45,000 words in size (the
    upper limit of samples in the British
    National Corpus).

    The main question here is if we are seeking lexical variety, if the lexicon
    basically consists of Large Numbers
    of Rare Events (LNREs), then why aren't we collecting language data to
    maximize the variety of that type of
    information rather than following the same traditional sampling practices of
    the earliest corpora?

    In the beginning, there was no machine-readable text. This meant that
    creating a corpus involved typing in text
    and the amount of text you could put into a corpus was limited primarily by
    the manual labor available to enter
    data. Because text was manually entered, one really couldn't analyze it
    until AFTER it had been selected for
    use in the corpus. You picked samples on the basis of their external
    properties and discovered their internal
    composition after including them in the corpus.

    Today, we largely create corpora based on obtaining electronic text and
    sampling from that text. This means that
    we have the additional ability to examine a lot of text before selecting a
    subset to become part of the corpus.
    While external properties of the selected text are as important as ever and
    should be representative of what types
    of text we feel are appropriate to "balance" the corpus, the internal
    properties of the text are still taken
    almost blindly, with little note of whether a sample increases the variety
    of lexical coverage or not.

    The question is whether we could track the number of new terms appearing in
    potential samples from a new source
    and optimally select the sample that added the most new terms to the corpus
    without biasing the end result.
    In my metaphor, whether we could add muscle to the corpus rather than just
    fatten it up.

    This also raises the question of why have sample sizes grown so large? The
    Brown corpus created a million words from
    500 samples of 2000 words each. Was 2000 words so small that everyone was
    complaining about how it stifled their
    ability to use the corpus? Or is it merely that given we want 100 million
    words of text it is far easier to
    increase the sample sizes by 20-fold than find 20 more sources from which to
    sample.



    This archive was generated by hypermail 2b29 : Thu Oct 03 2002 - 21:12:29 MET DST