Re: comparisons in text corpora: keywords / CHI square

Ted E. Dunning (ted@aptex.com)
Thu, 29 Aug 1996 17:31:25 -0700

Ted Dunning writes:

> [... chi^2 is a very bad choice ... G^2 may be good ]

Ted Dunning's message hit the nail on the head. To add another angle
to what Ted has already written, as yet another alternative to chi^2
testing you may want to look at exact tests such as Fisher's Exact
test or the exact conditional. These tests are designed to be used
with very sparse and skewed samples.

Ted Pedersen's message also hit the nail pretty well.

it should be pointed out that Fisher's exact test can be enormously
expensive to compute, depending on your situation. happily, a good
approximation can be had from the bootstrap technique.

essentially, the bootstrap applied to significance testing is just the
fisher's exact method computed by using a monte carlo method. the
bootstrap is considerably more general than this, but in this
application, the analogy holds very well.

some good references on the bootstrap include the following:

@book{efron82,
author={Bradley Efron},
title={The Jackknife, the bootstrap and other resampling plans},
publisher={SIAM},
year={1982}
}

@article{efron91,
author={Bradley Efron},
title={Statistical Data Analysis in the Computer Age},
journal={Science},
volume={253},
number={5018},
year={1991}
}