> 1 what exactly should I use as reference corpus: the complete corpus
> of abstracts or the complete corpus minus the subset. In the former
You should certainly not use the complete corpus which contains the subset. This
will throw out the resulting Chi-squared values you get because the two samples
are no longer 'independent'. So use the corpus minus the subset ...
> 2 can someone give me more background on the use of keywords as
> means of comparison between two sets (*actually, list of words).
> Commments, references to books, articles, URL's, etc, would be much
> appreciated.
MIght be worth looking at:
Adam Kilgarriff and Raphael Salkie Corpus similarity and homogeneity via word
frequency. Euralex '96, Gothenberg, Sweden.
Adam (Adam.Kilgarriff@itri.brighton.ac.uk) has done a lot of work recently in
this area.
Another obvious one is:
Francis, W.N. and Kucera, H (1982) Frequency Analysis of English Usage. Houghton
Mifflin Company, Boston.
I have a paper forthcoming in which we used Chi-squared comparison on vocabulary
in the British National Corpus (spoken part):
Rayson, P., Leech, G., and Hodges, M. (forthcoming). Social differentiation in
the use of English vocabulary: some analyses of the conversational component of
the British National Corpus. International Journal of Corpus Linguistics. John
Benjamins, Amsterdam/Philadelpia.
Regards,
Paul.
[]EMAIL: paul@comp.lancs.ac.uk [] Post: CSEG Research Centre, []
[]Phone: +44 1524 65201 [] Computing Department, []
[] extension 3262 [] Lancaster University, []
[]Fax : +44 1524 593608 [] Lancaster, LA1 4YR, UK.[]
[] WWW: http://www.comp.lancs.ac.uk/computing/users/paul/ []