comparisons in text corpora: keywords / CHI square

Marc Weeber (M.Weeber@farm.rug.nl)
Thu, 29 Aug 1996 12:14:01 CET

Hello corpora people,

At the moment, I'm trying to isolate certain areas in a corpus to
extract area-specific keywords. The corpus consists of abstracts of
medical articles concerning one drug. I'm interested in extracting
the side effects of this drug. I have located the areas concerning
side effects, and I want to compare these areas with the rest of the
corpus. The method I'm using is the keyword program of the WordSmith
Tools package. This program compares the frequencies of words between the
subset and the complete corpus. Words that are more frequent in the
subset compared to the complete set (test with CHI square) are called
`keywords' of the subset.

Now I have two questions:

1 what exactly should I use as reference corpus: the complete corpus
of abstracts or the complete corpus minus the subset. In the former
case, words that occur in the subset are counted twice (in subset and
in reference corpus). The results will be more conservative compared
to the latter case. However, I don't know which method to use, which
leads to the second question:
2 can someone give me more background on the use of keywords as
means of comparison between two sets (*actually, list of words).
Commments, references to books, articles, URL's, etc, would be much
appreciated.

thanks in advance,

Marc Weeber
marc@farm.rug.nl