Re: Corpora: history of corpora

Bill Fisher (william.fisher@nist.gov)
Tue, 1 Dec 1998 12:49:20 -0500

On Dec 1, 2:29pm, Oliver Mason wrote:

...
> .... A corpus is a special collection of textual material
> collected according to a certain set of criteria, like the BNC or the
> BoE, or Brown, COLT, Flob, LOB, whatever. They all made decisions
> about the composition of their data in advance and selected it
> accordingly.
...
> ... I am worried that the term `corpus' gets watered down too
> much it is basically used the same way as `archive'. An archive is
> less focussed on doing things with its data, and mainly concerned with
> storage, archival, and retrieval of its elements.
...

"Corpus" has an older and more general use which is captured
very well by your definition of "archive". Why don't we just
go by dictionary definitions? Here are 2 relevant ones, from
Webster's 3rd Unabridged:

"3a: the whole body or total amount of writings of a particular
particular kind or on a particular subject (as the total
production of a writer or the whole literature of a subject)
...
b: a collection or body esp. of knowledge or evidence;
specif : the collection of recorded utterances that is used
as a basis for the descriptive analyeis of a language or dialect"

Other standard dictionaries have similar definitions. Note
that there is no reference to criteria for selection, or on
uniformity of storage and retrieval. I think you're trying
to water up the term 'corpus' unnecessarily.

- Bill F.