Corpora: What is a corpus

From: Susan Hays (susanh@naa.att.ne.jp)
Date: Thu Jan 27 2000 - 22:05:20 MET

  • Next message: Lucian Galescu: "Re: Corpora: What is a corpus"

    Oliver has stuck an important chord with my thinking. Many of the questions asked
    on this list request pre-filtered work. A corpus is a collection of texts, not a
    list of phrases, verb forms, or other fragments.

    One of the real joys of working with corpora is the excitement of finding
    something you weren't looking for. The more the input to the corpus is filtered by
    the preconceptions of the researchers, the less likelihood that these unexpected
    insights will arise. Of course, the nature of the storage medium necessitates that
    some filtering must occur, but it is important that these technical requirements
    are kept in mind when examining the corpora. Only by looking for things we aren't
    looking for will we gain deep insights into the nature of language.

    -Paul Hays (currently writing from a borrowed eddress)

    Oliver Mason wrote:

    > François Maniez writes:
    > > I wondered whether anybody on the list knows about an online corpus
    > >available for download and consisting of English proverbs and/or set
    > >phrases. The objective is to turn the corpus into a data base that could
    > > [...]
    >
    > Andrew Harley replies:
    > > Instead of a corpus, you might want to consider using an existing
    > > dictionary which gives examples of idioms in context, e.g. the Cambridge
    > > International Dictionary of Idioms. This is available as SGML data for
    >
    > Sorry to appear pedantic, but how would a `corpus of proverbs' look
    > like? I would think no such thing could exist, just like you couldn't
    > have a corpus of past tense sentences. Instead, you have a corpus of,
    > say, written fiction, which you can use to compile a list/database of
    > proverbs, but that would not be a corpus, but a, erm, list or
    > database (or even a dictionary).
    >
    > My understanding of `corpus' is that it is some more or less
    > homogeneous collection of utterances, but not `filtered', ie if you
    > selected all sentences containing proverbs you would end up with a
    > list, not a (sub)corpus.
    >
    > Do other people think different/the same?
    >
    > Oliver
    >
    > --
    > //\\ computer officer | corpus research | department of english | school of -
    > //\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt -
    > \\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\ -
    > \\// mobile 07050 104504 | http://www.clg.bham.ac.uk | o.mason@bham.ac.uk\/ -



    This archive was generated by hypermail 2b29 : Fri Jan 28 2000 - 00:38:27 MET