Re: [Corpora-List] fisher's exact test

From: Beek L.J.van der (vdbeek@let.rug.nl)
Date: Fri Nov 12 2004 - 16:37:15 MET

Next message: Rob Koeling: "Re: [Corpora-List] corpus ------>>>>> thesaurus"

Previous message: Diana Maynard: "Re: [Corpora-List] TERM EXTRACTION TOOLS"
In reply to: ted pedersen: "Re: [Corpora-List] fisher's exact test"
Next in thread: ted pedersen: "Re: [Corpora-List] fisher's exact test"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

hi all -again-

in case you have both n11>n22 and n11<n22 in your data set, you might
want to try changing '$i=1' into '$i=($n11 + 1)' in the leftFisher file of
the nsp-package (v.071, line 231). As $n11 is set to the minimal value that
does not lead to a negative n22 (earlier in the script), this should solve
the problem mentioned in Ted Pedersen's email and pasted below.

Ofcourse, this does not solve the problem of underflow errors...

best
leonoor

--
Leonoor van der Beek, vdbeek@let.rug.nl
http://odur.let.rug.nl/~vdbeek
Rijksuniversiteit Groningen, Informatiekunde
Pb 716, 9700 AS Groningen, The Netherlands
tel. +31.50.3635977, fax  +31.50.3636855
On Thu, 11 Nov 2004, ted pedersen wrote:
> 
> > Does anyone know a good Perl implementation of Fisher's Exact Test for
> > very skewed distributions (with frequencies ranging from 0 to 1000+)?
> >
> > I've tried the NSP-package (version 0.71), but it doesn't always give the
> > correct results. Has anyone noticed (or even better: fixed) this before?
> >
> 
> NSP exact tests work pretty well for skewed distributions, however, in
> general it is assuming that the data is coming from ngram counts, and
> that is what has led to the problem above.
> 
> In particular, if we have a 2x2 table representing the bigram counts:
> 
> n11 n12 | n1p
> n21 n22 | n2p
> ---------
> np1 np2   npp
> 
> n11 represents the number of times w1 and w2 occur together, n12
> represents the number of bigrams where w1 is the first word and w2 is not,
> etc. Typically n22 is very large (since that represents the count of all
> the other bigrams in the sample that aren't w1 and w2). Of course n11 is
> much smaller than the sample size, making the distribution quite skewed.
> 
> Now, in the case of this user, the data is more like this:
> 
>  10 2 | 12
>   3 1 |  4
> -------
>  13 3   16
> 
> which aren't for ngram counts of course. So here n22 < n11, and that
> actually causes a problem for our exact test implementation! Implicity in
> our code is the faulty assumption that people would only be using this
> for collocations, and I'm glad to see I was wrong about that. :) But, we
> should make this limitation more clearly known, and better yet  we should
> just fix it, which I think we will!
> 
> This is discussed in a bit more detail below:
> 
> http://groups.yahoo.com/group/ngram/messages/15
> http://groups.yahoo.com/group/ngram/messages/17
> 
> Now in the case above, if the table is reorganized so that it is
> (equivalently) shown as below, everything is fine. So NSP exact tests
> (for now) require that n22 > n11.
> 
>   1 2   |  3
>   3 10  | 13
> -------
>   4 12    16
> 
> Cordially,
> Ted
> 
> PS NSP turns 4 years old on November 30.  Big party in Duluth, you are
> all invited. :)
> 
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

Next message: Rob Koeling: "Re: [Corpora-List] corpus ------>>>>> thesaurus"
Previous message: Diana Maynard: "Re: [Corpora-List] TERM EXTRACTION TOOLS"
In reply to: ted pedersen: "Re: [Corpora-List] fisher's exact test"
Next in thread: ted pedersen: "Re: [Corpora-List] fisher's exact test"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Nov 12 2004 - 16:38:19 MET