Re: comparisons in text corpora: keywords / CHI square

Paul Rayson (paul@comp.lancs.ac.uk)
Thu, 29 Aug 1996 13:49:17 +0100 (BST)

Dear Marc,

> 1 what exactly should I use as reference corpus: the complete corpus
> of abstracts or the complete corpus minus the subset. In the former

You should certainly not use the complete corpus which contains the subset. This
will throw out the resulting Chi-squared values you get because the two samples
are no longer 'independent'. So use the corpus minus the subset ...

> 2 can someone give me more background on the use of keywords as
> means of comparison between two sets (*actually, list of words).
> Commments, references to books, articles, URL's, etc, would be much
> appreciated.

MIght be worth looking at:

Adam Kilgarriff and Raphael Salkie Corpus similarity and homogeneity via word
frequency. Euralex '96, Gothenberg, Sweden.

Adam (Adam.Kilgarriff@itri.brighton.ac.uk) has done a lot of work recently in
this area.

Another obvious one is:

Francis, W.N. and Kucera, H (1982) Frequency Analysis of English Usage. Houghton
Mifflin Company, Boston.

I have a paper forthcoming in which we used Chi-squared comparison on vocabulary
in the British National Corpus (spoken part):

Rayson, P., Leech, G., and Hodges, M. (forthcoming). Social differentiation in
the use of English vocabulary: some analyses of the conversational component of
the British National Corpus. International Journal of Corpus Linguistics. John
Benjamins, Amsterdam/Philadelpia.

Regards,
Paul.

[]EMAIL: paul@comp.lancs.ac.uk [] Post: CSEG Research Centre, []
[]Phone: +44 1524 65201 [] Computing Department, []
[] extension 3262 [] Lancaster University, []
[]Fax : +44 1524 593608 [] Lancaster, LA1 4YR, UK.[]
[] WWW: http://www.comp.lancs.ac.uk/computing/users/paul/ []