Re: [Corpora-List] fisher's exact test

From: ted pedersen (tpederse@d.umn.edu)
Date: Fri Nov 12 2004 - 03:02:54 MET

  • Next message: Marco Baroni: "Re: [Corpora-List] TERM EXTRACTION TOOLS"

    > Does anyone know a good Perl implementation of Fisher's Exact Test for
    > very skewed distributions (with frequencies ranging from 0 to 1000+)?
    >
    > I've tried the NSP-package (version 0.71), but it doesn't always give the
    > correct results. Has anyone noticed (or even better: fixed) this before?
    >

    NSP exact tests work pretty well for skewed distributions, however, in
    general it is assuming that the data is coming from ngram counts, and
    that is what has led to the problem above.

    In particular, if we have a 2x2 table representing the bigram counts:

    n11 n12 | n1p
    n21 n22 | n2p
    ---------
    np1 np2 npp

    n11 represents the number of times w1 and w2 occur together, n12
    represents the number of bigrams where w1 is the first word and w2 is not,
    etc. Typically n22 is very large (since that represents the count of all
    the other bigrams in the sample that aren't w1 and w2). Of course n11 is
    much smaller than the sample size, making the distribution quite skewed.

    Now, in the case of this user, the data is more like this:

     10 2 | 12
      3 1 | 4
    -------
     13 3 16

    which aren't for ngram counts of course. So here n22 < n11, and that
    actually causes a problem for our exact test implementation! Implicity in
    our code is the faulty assumption that people would only be using this
    for collocations, and I'm glad to see I was wrong about that. :) But, we
    should make this limitation more clearly known, and better yet we should
    just fix it, which I think we will!

    This is discussed in a bit more detail below:

    http://groups.yahoo.com/group/ngram/messages/15
    http://groups.yahoo.com/group/ngram/messages/17

    Now in the case above, if the table is reorganized so that it is
    (equivalently) shown as below, everything is fine. So NSP exact tests
    (for now) require that n22 > n11.

      1 2 | 3
      3 10 | 13
    -------
      4 12 16

    Cordially,
    Ted

    PS NSP turns 4 years old on November 30. Big party in Duluth, you are
    all invited. :)

    --
    Ted Pedersen
    http://www.d.umn.edu/~tpederse
    



    This archive was generated by hypermail 2b29 : Fri Nov 12 2004 - 03:09:52 MET