RE: Corpora: What is a corpus

From: Mills, Carl (MILLSCR) (MILLSCR@UCMAIL.UC.EDU)
Date: Fri Jan 28 2000 - 16:26:30 MET

  • Next message: jock@ccl.umist.ac.uk: "Re: Corpora: What is a corpus"

    While I agree with Susan that

    >One of the real joys of working with corpora is the excitement of finding
    >something you weren't looking for. The more the input to the corpus is
    >filtered by the preconceptions of the researchers, the less likelihood that
    >these unexpected insights will arise

    I have some difficulty with

    >A corpus is a collection of texts, not a list of phrases, verb forms, or
    other >fragments.

    Similarly, Oliver's point

    >The main point I wanted to make was that I understand a
    >corpus to be a lump of real language, not extracts of the same. So
    you
    >could have a corpus of almost anything that is a text type or
    genre,
    >but it wouldn't be a corpus any more once you meddle with it, by eg
    > extracting all proverbs, noun phrases or whatnot.
    [snip]
    >By what I rather unprecisely called `filtering' I meant this
    extraction
    >of elements from a corpus, not the creation of a corpus from the
    >infinite amount of language data by selecting a sample of it.

    raises some interesting, related issues.

    As a statistical linguist, a sociolinguist, I would argue that there may be
    reasons--due to limits of technology, researcher time, or whatever--when one
    might want to create a corpus (Or would it be a quasi-corpus? Or a
    corpoid?) by properly designed random sampling from a text or body of
    texts. The result would be a collection of "fragments," in Susan's terms,
    "extracts" from Oliver's "lump of real language." However, I would argue
    that such procedures are not only methodologically justifiable, but that the
    results would constitute a corpus. I think such a collection of samples
    could be called a corpus because the sampling would not be defined in
    biased, a priori units like "sentence," "proverb," or whatever.

    Oliver's remarks would seem to allow such corpora. I am not sure Susan's
    would.

    Carl



    This archive was generated by hypermail 2b29 : Fri Jan 28 2000 - 23:46:20 MET