BNC word frequencies

Lou Burnard (lou@vax.ox.ac.uk)
Tue, 25 Feb 1997 11:12:00 +0000

From: OXVAXD::LOU "Lou Burnard" 25-FEB-1997 11:06:44.71
To: MX%"lcjohn@uxmail.ust.hk"
CC: LOU
Subj: Re: requests for corpora

|Agreed, but if you buy into the orthodoxy that there are representative
|corpora that somehow are a snapshot of the language as a whole (e.g., the
|BNC or Cobuild), then it must follow that the wordlists from at least
|these corpora are authoritative in some way.

I can't speak for COBUILD, but I would feel distinctly uneasy about the notion
that any frequency list derived from the BNC was in any way "representative" of
the language, or "authoritative". (And indeed, have done so in the past on this
very list). The only claims the BNC makes are that (a) it is very large (b) its
designers made a conscious effort to include samples from as many radically
different text varieties as possible (c) they also made a conscious effort to
define target proportions for each variety. The criteria by which varieties
were defined and their target proportions are all documented in the BNC manual,
and also on our web site (http://info.ox.ac.uk/bnc). I continue to regard with
deep suspicion the notion (beloved of some) that there is a "core" vocabulary
in any language. There is such a thing in a given sample of the language (such
as the BNC, or the Wall Street Language, or all-the-English-I-need-to-know-to-
pass-my-exams (which is what I assume concerns John here). There isn't any such
thing in the English language: any more than there is such a thing as syntax
independent of lexis!

|But since most of the world
|doesn't have access to these or other large and 'principally' collected
|archives, then is it so unusual that people are looking for some type of
|representative wordlist? I've had to rely on the generosity of members of
|this conference (thanks!) for such wordlists.

I know this is a sore point with several non-Europeans, and we are still trying
our best to get the restrictions on access to the BNC lifted. However, your
note reminds me that there is absolutely no reason why the BNC word frequency
list should not be distributed freely. Indeed, Adam Kilgarriff's web site
already does it, along with several interesting and intelligent papers about
word frequency in the BNC (see http://www.itri.bton.ac.uk/~Adam.Kilgarriff). Be
warned however that we are currently recompiling the BNC word index using a
more accurate procedure. This new index will be used by the British Library's
online service -- and the word frequencies will all be subtly different, as a
result.