Re: Corpora: history of corpora

GCW (williams@ensinfo.univ-nantes.fr)
Wed, 2 Dec 1998 07:58:09 +0100 (MET)

I must say that I wholeheartedly agree with Oliver.
OK the word corpus has a long history, but we all on this list know we are
talking about electronic corpora, and that as specialists talking to
specialists, we should be clear about. As all specialists know, you cannot
rely on a general use dictionary to define specialist terminology.
Personally I quote Sinclair 1996,the EAGLES report as being the nearest we
have, and maybe need, to an 'official' definition. The key words are
selection, representivity and balance. Of course we all interpret these in
a slightlydifferent way as we all have radically different projects,
the most important notion is that there are selection criteria, and we
should be clear about ours as there is no ideal solution. Personally the
'Le Monde' corpus, much used in NLP in France, is not, in my opinion, a
corpus but simply 'Le Monde' for a given period. This is not to define its
usefulness, just to be aware of the drawbacks. It is possible that not all
in NLP are aware of the drawbacks as corpus building is not there brief,
it is therefore up to us to be clear.

On the other hand, it seems to me obvious that any history of electronic
corpora must include mention of the archives and collections as these are
part of our history and everyday working lives. The only proviso is to be
clear as to what distinguishes one from the other.

Best wishes

Geoffrey
williams@ensinfo.univ-nantes.fr

Faculte des Sciences et des Techniques
University of Nantes
France

On Tue, 1 Dec 1998, Oliver Mason wrote:

> > > Thesis and I would like to know when the following electronic corpora
> > > were compiled:
> >
> > >"The Oxford Text Archive";
> > >"International Computer Archive of Modern English".
>
> I don't want to split hairs or start an ideological flame war, but I
> personally wouldn't call those two `electronic corpora'. They're (as
> implied by the name) archives, which *contain* (amongst other data)
> corpora. A corpus is a special collection of textual material
> collected according to a certain set of criteria, like the BNC or the
> BoE, or Brown, COLT, Flob, LOB, whatever. They all made decisions
> about the composition of their data in advance and selected it
> accordingly.
>
> Also, they are homogeneous in the way they are stored/accessed. For
> the BNC you have got SARA, there's Lookup for the BoE, and CUP probably
> have their own special software for their corpus.
>
> Now, correct me if I'm wrong, but does the OTA do the same? Again, I
> DON'T want to criticise anything here, it's just a terminological
> distinction. I am worried that the term `corpus' gets watered down too
> much it is basically used the same way as `archive'. An archive is
> less focussed on doing things with its data, and mainly concerned with
> storage, archival, and retrieval of its elements. If I want an
> electronic copy of a certain book I would use the OTA, but for
> concordance lines of some word I wouldn't.
>
> Anybody else agrees, disagrees?
>
> Oliver
>
> --
> //\\ computer officer | corpus research | department of english | school of -
> //\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt -
> \\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\ -
> \\// mobile 07050 104504 | http://www-clg.bham.ac.uk | o.mason@bham.ac.uk\/ -
>
>
>