Corpora: corpus equilibrium?

Jean Hudson (jhudson@cup.cam.ac.uk)
Thu, 9 Oct 1997 14:03:40 +0100

Whoah there Jem Clear...

I'm not suggesting for a moment that balancing corpora on the basis of word
frequency in texts is the way to God's truth about language. I agree with
the sea analogy - to the extent that, yes, it's a bucketful here, a
bucketful there, and measuring the magnesium content tells me nothing about
the depth to the sea bottom or what life forms are supported there. But you
have to agree that taking odd bucketfuls from different parts of the globe
on the basis of known facts about local conditions makes laboratory
investigations a squidge more informative than they would be if they were
based on what could be pumped through a direct pipeline in the Thames estuary.

One alternative, of course, is to admit defeat and go read a good book -
only to discover that there are more things in language than were dreamed of
in any philosophy.

Another is to treat your corpus as a tool - no more, no less. If you're
wanting to drive in a nail, you use a hammer, and for a screw, a screwdriver
(I like analogies, too). Some (like my colleague, on whose behalf I posted
the query) use corpora to study the vocabulary of the language. To them, a
balanced corpus is one which has texts that contain 'a bit of everything' in
the way of vocabulary. Some (like me) are interested in the effects of
different interpersonal relationships on form/meaning choices in discourse.
For me a balanced corpus is 'a bit of everyone'.

But seriously, I think there's been a slight mis-communication here,
probably caused by my use of that ever so loaded word 'balance'. In the BNC
debate I, too, would have sided with John and Willem - and yourself. It is
true that all too many corpus users carry out their work in an aura of
religious fervour, where The Corpus is The Bible. Of course it's not - it's
a pair of specs that might help you read a bit more clearly.

But enough's enough - I think I'll toddle off and get me another thimbleful...

Jean