Corpora: What is a corpus

From: David Powers (David.Powers@flinders.edu.au)
Date: Fri Jan 28 2000 - 03:22:55 MET

  • Next message: Mike Scott: "Re: Corpora: What is a corpus"

    I tend to agree at one level, but a corpus of proverbs is a possibility =
    - e.g. the Bible
    contains one, the dictionaries and collections of proverbs are corpora - =
    not so different
    from the corpus of Shakespeare or a corpus of religious or legal =
    writings or telephone
    conversations or parent-child speech, although the bibilical proverbs
    usually take a more extended form than our English ones (some of which =
    come
    from the Bible anyway).

    But once you go below sentence level, you are bringing in the kind of =
    assumptions we aim to avoid in corpus work. Even selection at 'sentence =
    level' is problematic due to process of context and elision, stylistic =
    freedom in relation to punctuation and representation of clauses as =
    lists or separate sentences, etc. e.g.

    What time is it? Three thirty!
    I came, I saw, I conquered!
    I came! I saw! I conquered!

    Another tendency is for statistics about parsers to be based on =
    sentences restricted to be
    less than X words where X is typically around 20 and usually less than =
    median length
    for the corpus it is extracted from. Such practices should be =
    deprecated except when filtering is integral to a theory (e.g. of =
    language acquisition - attending to only certain types of utterance - =
    but this doesn't alter the corpus).

    dP
    -----Original Message-----
    From: Susan Hays <susanh@naa.att.ne.jp>
    To: CORPORA@hd.uib.no <CORPORA@hd.uib.no>
    Date: Friday, January 28, 2000 9:15 AM
    Subject: Corpora: What is a corpus

    >Oliver has stuck an important chord with my thinking. Many of the =
    questions
    asked
    >on this list request pre-filtered work. A corpus is a collection of =
    texts,
    not a
    >list of phrases, verb forms, or other fragments.
    >
    >One of the real joys of working with corpora is the excitement of =
    finding
    >something you weren't looking for. The more the input to the corpus is
    filtered by
    >the preconceptions of the researchers, the less likelihood that these
    unexpected
    >insights will arise. Of course, the nature of the storage medium
    necessitates that
    >some filtering must occur, but it is important that these technical
    requirements
    >are kept in mind when examining the corpora. Only by looking for things =
    we
    aren't
    >looking for will we gain deep insights into the nature of language.
    >
    >-Paul Hays (currently writing from a borrowed eddress)
    >
    >Oliver Mason wrote:
    >
    >> Fran=E7ois Maniez writes:
    >> > I wondered whether anybody on the list knows about an online
    corpus
    >> >available for download and consisting of English proverbs and/or set
    >> >phrases. The objective is to turn the corpus into a data base that
    could
    >> > [...]
    >>
    >> Andrew Harley replies:
    >> > Instead of a corpus, you might want to consider using an existing
    >> > dictionary which gives examples of idioms in context, e.g. the
    Cambridge
    >> > International Dictionary of Idioms. This is available as SGML data =
    for
    >>
    >> Sorry to appear pedantic, but how would a `corpus of proverbs' look
    >> like? I would think no such thing could exist, just like you =
    couldn't
    >> have a corpus of past tense sentences. Instead, you have a corpus =
    of,
    >> say, written fiction, which you can use to compile a list/database of
    >> proverbs, but that would not be a corpus, but a, erm, list or
    >> database (or even a dictionary).
    >>
    >> My understanding of `corpus' is that it is some more or less
    >> homogeneous collection of utterances, but not `filtered', ie if you
    >> selected all sentences containing proverbs you would end up with a
    >> list, not a (sub)corpus.
    >>
    >> Do other people think different/the same?
    >>
    >> Oliver
    >>
    >> --
    >> //\\ computer officer | corpus research | department of english | =
    school
    of -
    >> //\\ humanities | university of birmingham | edgbaston | birmingham =
    b15
    2tt -
    >> \\// united kingdom | phone +44-(0)121-414-6206 | fax
    +44-(0)121-414-5668/\ -
    >> \\// mobile 07050 104504 | http://www.clg.bham.ac.uk |
    o.mason@bham.ac.uk\/ -



    This archive was generated by hypermail 2b29 : Fri Jan 28 2000 - 12:19:47 MET