Re: Corpora: HELP

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Mon, 7 Sep 1998 16:41:19 +0100

Duane,

A word frequency list for conversational British English
is retrievable from my website. README attached below - you'll
be interested in the 'regular conversation' files, called 'demog'
(it's a DEMOGraphic sample).

Please credit me, the University of Brighton, and the BNC in any
use of these lists.

Yours,

Adam Kilgarriff

README for ftp.itri.bton.ac.uk/pub/bnc
======================================

Adam Kilgarriff
20 Nov 1995
Updated 15 March 1996

The files in this directory relate to the British National Corpus (BNC).

They are a bibliographical database, various frequency lists,
and a file giving variances of word frequencies (details in
variances.doc).

bib-dbase a one-line-per-file bibliographic database for the
4124 files in the BNC. (The first part of the file
is the describes the coding scheme.)

Frequency lists:

These are all available in 6 forms:

* sorted alphabetically ("al")
or by frequency (highest frequency first) ("num");
* the complete lists, or a smaller file containing only those
items occurring over five times (suffix "o5");
* all lists are available compressed using gzip (".gz"). The
o5 lists are also available uncompressed (no suffix).

The frequencies are for <CLAWS-word, POS> pairs. NB some CLAWS words
- eg "in spite of" are not orthographic words, while others are
numbers etc, and some POS's are CLAWS 'portmanteau tags', eg NN1-VVB,
where CLAWS was uncertain as to whether the word was a singular common
noun or base form of a verb. See BNC manual for serious documentation,
also my "Putting frequencies in the dictionary" (available via www home
page, see adddress below) for detailed discussion of frequency lists.

The format is: four fields, separated by spaces.

1: frequency
2: word
3: pos
4: number of files the word occurs in

For non-orthographic words, spaces are replaced by underscore, giving
eg "in_spite_of"

cg 'context-governed' spoken material
(eg meetings, lectures etc) 6.2M tokens, 79,906 types
demog 'demographic' spoken material
(eg conversation) 4.2M tokens, 54,652 types
written 89.7M tokens, 921,074 types
all 100.1M tokens, 939,028 types

Sizes in MB ("al" and "num" variants all the same size)

all uncompressed .gz o5 o5.gz
-------------------------------------------------------------
all 18.1 4.8 4.0 1.32
cg 1.4 0.39 0.43 0.15
demog 0.9 0.26 0.25 0.09
written 17.8 4.7 3.9 1.30
-------------------------------------------------------------

For further information on the BNC see

http://info.ox.ac.uk/bnc

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4AT email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%