Re: Corpora: Corpus of scientific texts

DL (d.lee@lancaster.ac.uk)
Fri, 23 Oct 1998 07:41:18 +0100 (BST)

Geoffrey Williams wrote:

>I like most people interested in specialised corpora are seeking domain
>and topic specific material, the BNC is not spzecialised enough to cover
>this, it wasn't their brief. If working on research it is vital to
>distinguish between the different genre, mixing New Scientist and the
>Lancet would be ridiculous. In addition the genre types in the Lancet
>differ greatly from Nature which differs from specialised academic
>research articles.

Hmmm...perhaps I wasn't clear in my original message that those texts
from the journals and magazines I mentioned were all *separate* texts,
not all bundled into one large file called 'Applied Science'.

This is in keeping with the design criteria of the BNC corpus
builders... the idea is to make it flexible enough so that people can
create their _own_ subcorpora if they so wish. (The header of each SGML
file clearly documents practically everything you need to know about
the source, audience, author, etc. of the text).

>Personaly I believe specialisation means building your own corpus. Not
>that tough, takes time but at least you control the parameters. All
>depends on your purposes.

Certainly. But I suspect that very large, general corpora such as the
BNC are actually useful for a lot more specialised research projects
than people imagine... it contains most of the types of text which a
large variety of researchers will want to examine, plus a lot more. By
studying the bibliographical database of BNC texts, it is possible to
pick and choose (nicely prepared and tagged) texts for your own
purposes. Why reinvent the circle (or, more pointedly, why go through
the expense and trouble of building a second bridge when a perfectly
usable one already exists which everyone else has access to)?

David Lee

P.S. I am not being paid to advertise the BNC :-). And actually, I
would like to gripe that the BNC does not actually have clearly marked,
separate texts of _telephone_ conversations (which I wanted to include
in my own research). It has files of spontaneous conversations which
sometimes include telephone dialogue (e.g. in demographic files where
the respondent is taping everything that goes on in the living room,
including the odd phone call), but these are subsets of files, not
separate files, and it's very difficult to extract these telephone
conversations to form coherent texts.

------------------------------------------------------------------------

David Lee
Dept of Linguistics
Lancaster University
Lancaster LA1 4YT
England, UK.

Email: D.Lee@lancaster.ac.uk
-------------------------------------------------------------------------