Re: [Corpora-List] fisher's exact test

From: ted pedersen (tpederse@d.umn.edu)
Date: Fri Nov 12 2004 - 03:02:54 MET

Next message: Marco Baroni: "Re: [Corpora-List] TERM EXTRACTION TOOLS"

Previous message: Beek L.J.van der: "[Corpora-List] fisher's exact test"
In reply to: Beek L.J.van der: "[Corpora-List] fisher's exact test"
Next in thread: Beek L.J.van der: "Re: [Corpora-List] fisher's exact test"
Next in thread: ted pedersen: "Re: [Corpora-List] fisher's exact test"
Next in thread: Marco Baroni: "Re: [Corpora-List] TERM EXTRACTION TOOLS"
Reply: Beek L.J.van der: "Re: [Corpora-List] fisher's exact test"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Does anyone know a good Perl implementation of Fisher's Exact Test for
> very skewed distributions (with frequencies ranging from 0 to 1000+)?
>
> I've tried the NSP-package (version 0.71), but it doesn't always give the
> correct results. Has anyone noticed (or even better: fixed) this before?
>

NSP exact tests work pretty well for skewed distributions, however, in
general it is assuming that the data is coming from ngram counts, and
that is what has led to the problem above.

In particular, if we have a 2x2 table representing the bigram counts:

n11 n12 | n1p
n21 n22 | n2p
---------
np1 np2 npp

n11 represents the number of times w1 and w2 occur together, n12
represents the number of bigrams where w1 is the first word and w2 is not,
etc. Typically n22 is very large (since that represents the count of all
the other bigrams in the sample that aren't w1 and w2). Of course n11 is
much smaller than the sample size, making the distribution quite skewed.

Now, in the case of this user, the data is more like this:

10 2 | 12
3 1 | 4
-------
13 3 16

which aren't for ngram counts of course. So here n22 < n11, and that
actually causes a problem for our exact test implementation! Implicity in
our code is the faulty assumption that people would only be using this
for collocations, and I'm glad to see I was wrong about that. :) But, we
should make this limitation more clearly known, and better yet we should
just fix it, which I think we will!

This is discussed in a bit more detail below:

http://groups.yahoo.com/group/ngram/messages/15
http://groups.yahoo.com/group/ngram/messages/17

Now in the case above, if the table is reorganized so that it is
(equivalently) shown as below, everything is fine. So NSP exact tests
(for now) require that n22 > n11.

  1 2 | 3
  3 10 | 13
-------
  4 12 16

Cordially,
Ted

PS NSP turns 4 years old on November 30. Big party in Duluth, you are
all invited. :)

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Next message: Marco Baroni: "Re: [Corpora-List] TERM EXTRACTION TOOLS"
Previous message: Beek L.J.van der: "[Corpora-List] fisher's exact test"
In reply to: Beek L.J.van der: "[Corpora-List] fisher's exact test"
Next in thread: Beek L.J.van der: "Re: [Corpora-List] fisher's exact test"
Next in thread: ted pedersen: "Re: [Corpora-List] fisher's exact test"
Next in thread: Marco Baroni: "Re: [Corpora-List] TERM EXTRACTION TOOLS"
Reply: Beek L.J.van der: "Re: [Corpora-List] fisher's exact test"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Nov 12 2004 - 03:09:52 MET