Re: word frequency lists?

Adam Kilgarriff (ak28@it-research-institute.brighton.ac.uk)
Fri, 24 Nov 95 11:04:24 GMT

Henry/Ted,

Of course different discourses have very different word frequency
lists, and this means that, for any NLP application, the most useful
lists are ones drawn from the relevant corpus. But, as Richard
Piepenbrock points out, that's only one sort of purpose and a general
list serves lots of other purposes - such as investigating the mental
lexicon. We presumably access much the same mental representation for
a word used with similar meaning in different sublanguages. Sublang
vocabularies don't have nothing to do with each other. To say "each
sublanguage unto itself" is to throw away the possibility of a general
theory talking about word frequency. Not something psycholinguists
would be happy about, or NLP should be happy about.

Of course there are problems with lists from general purpose corpora
such as the BNC (my note re availability of BNC freq lists re-posted
below), and maybe the lists should include a measure of variability as
well as the total count (I have variance & IDF figures for parts of the BNC
- mail me if interested) but writing off lists based on general
purpose corpora because they don't accurately reflect the frequencies
of any given sublanguage is like saying the Dow Jones is useless
because it doesn't tell me about Exxon shares.

Of course it's hard arriving at theory - doesn't mean we shouldn't
try.

Re: "right" and "left" - form words are generally hugely more frequent
than content words so it's not surprising that the 'direction' sense
of "right" is dwarfed by the discourse marker, even in a text about
directions. Hey! It's almost theory!

Re: Jeremy Clear's (rhetorical) question on stability of core vocab -
yes, almost all - 96% - of the top 3,000 words in one large
general-purpose corpus are also in the top 3,000 in another.

Re: Mark Johnson's comments:
> What implications does this have for broad-coverage parsing? Should
> we be looking for systems that try to automatically adapt to (i.e.,
> learn) the domain they are given to parse?

Yes, of course we should, but we also need more theory to tell us how
the adaptation should relate to the starting point (eg general purpose
grammars and lexicons) we are adapting from.

Adam

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT ak28@itri.bton.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

BNC Word-POS frequency lists available
======================================

Frequency lists for English word-POS pairs, produced from the British
National Corpus, are available for ftp (username: "anonymous") from
our ftp site,

ftp.itri.bton.ac.uk

in directory

pub/bnc

Frequency lists are available for:

cg 'context-governed' spoken material
(eg meetings, lectures etc) 6.2M tokens, 79,906 types
demog 'demographic' spoken material
(eg conversation) 4.2M tokens, 54,652 types
written 89.7M tokens, 921,074 types
all 100.1M tokens, 939,028 types

They are all available in 2 forms, sorted alphabetically ("al") or
by frequency (highest frequency first) ("num"). They are compressed
using gzip. The frequencies are for <CLAWS-word, CLAWS-POS> pairs.

NB1 Some CLAWS words - eg "in spite of" are not orthographic words.
For non-orthographic words, spaces are replaced by underscore, giving
eg "in_spite_of".

NB2 Others are numbers and assorted other anomalies: in each list,
something like half the types are hapax legomena (eg occur only once)
and most of these are not "words" but may be, eg, numbers, names,
word&punctuation combinations, ...

NB3 Some POS's are CLAWS 'portmanteau tags', eg NN1-VVB, where CLAWS
was uncertain as to whether the word was a singular common noun or
base form of a verb.

See BNC manual for documentation.

The format is: four fields, separated by spaces.

1: frequency
2: word
3: pos
4: number of files the word occurs in

For further information on the BNC see

http://info.ox.ac.uk/bnc

Comments and enquiries most welcome.