Word frequency lists available

Adam Kilgarriff (ak28@it-research-institute.brighton.ac.uk)
Mon, 20 Nov 95 11:32:45 GMT

BNC Word-POS frequency lists available
======================================

Frequency lists for English word-POS pairs, produced from the British
National Corpus, are available for ftp (username: "anonymous") from
our ftp site,

ftp.itri.bton.ac.uk

in directory

pub/bnc

Frequency lists are available for:

cg 'context-governed' spoken material
(eg meetings, lectures etc) 6.2M tokens, 79,906 types
demog 'demographic' spoken material
(eg conversation) 4.2M tokens, 54,652 types
written 89.7M tokens, 921,074 types
all 100.1M tokens, 939,028 types

They are all available in 2 forms, sorted alphabetically ("al") or
by frequency (highest frequency first) ("num"). They are compressed
using gzip. The frequencies are for <CLAWS-word, CLAWS-POS> pairs.

NB1 Some CLAWS words - eg "in spite of" are not orthographic words.
For non-orthographic words, spaces are replaced by underscore, giving
eg "in_spite_of".

NB2 Others are numbers and assorted other anomalies: in each list,
something like half the types are hapax legomena (eg occur only once)
and most of these are not "words" but may be, eg, numbers, names,
word&punctuation combinations, ...

NB3 Some POS's are CLAWS 'portmanteau tags', eg NN1-VVB, where CLAWS
was uncertain as to whether the word was a singular common noun or
base form of a verb.

See BNC manual for documentation.

The format is: four fields, separated by spaces.

1: frequency
2: word
3: pos
4: number of files the word occurs in

For further information on the BNC see

http://info.ox.ac.uk/bnc

Comments and enquiries most welcome,

Adam Kilgarriff

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT ak28@itri.bton.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%