Re: [Corpora-List] fisher's exact test

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Fri Nov 12 2004 - 12:22:33 MET

  • Next message: Diana Maynard: "Re: [Corpora-List] TERM EXTRACTION TOOLS"

    Hi Leonoor, hi Ted,

    one thing you have to be aware of is that you must use the rightFisher
    measure in NSP to obtain p-values for Fisher's exact test (the
    leftFisher values aren't directly meaningful in the context of
    statistical hypothesis tests).

    Fisher's test is know to be problematic for larger samples, especially
    with skewed distributions, and it is traditionally only applied to
    tables with very small numbers (as in Ted's example). That said, the
    NSP implementation (of rightFisher) uses a "naive" multiplicative
    algorithm, which should give the most accurate results you can
    normally hope to get. This leaves to possible problems:

    a) The naive implementation can be excruiatingly slow for large
    marginal frequencies (I typically have tables where n = 10^6 and np1
    and n1p can be greater than 1000), especially since it's written in
    pure Perl.

    b) In such extreme cases, you might even get an underflow error, when
    the computed p-values are below 10^{-260} or so (I've observed values
    as small as 10^{-10000} for cooccurrence data!).

    You might want to consider using the log-likelihood measure (ll)
    instead, which gives a very good approximation to the exact p-values
    of Fisher's test, is easy to compute, and is numerically stable.

    If you really want to use Fisher's test on large samples, there's an
    implementation in my UCS toolkit (sorry for the shameless plug :o),
    which uses statistical functions from R (www.r-project.org) and a Perl
    wrapper to get accurate values even in extreme cases (at least I hope
    it does, I haven't really smoke-tested it yet). It makes the same
    assumption as NSP, though, that n11 < n22 (or rather, it even assumes
    that n11 is small compared to n). If you're interested, you can
    download the UCS toolkit from

    http://www.collocations.de/software.html

    The installation isn't quite as simple as with NSP (since UCS has
    additional requirements), but it is known to run on Linux, Mac OS X,
    and experimentally in a Cygwin environment on Windows. The good news
    is that you can easily import NSP data sets for bigram data. :o)

    Best wishes,
    Stefan

    > Will you send me the input you are giving to Fisher's Left test? I think
    > that's the easiest way to figure things out!
    >
    > Cordially,
    > Ted (of NSP :)
    >
    > On Thu, 11 Nov 2004, Beek L.J.van der wrote:
    >
    > >
    > > Does anyone know a good Perl implementation of Fisher's Exact Test for
    > > very skewed distributions (with frequencies ranging from 0 to 1000+)?
    > >
    > > I've tried the NSP-package (version 0.71), but it doesn't always give the
    > > correct results. Has anyone noticed (or even better: fixed) this before?
    > >
    > > thanks,
    > > Leonoor
    > >
    > > --
    > > Leonoor van der Beek, vdbeek@let.rug.nl
    > > http://odur.let.rug.nl/~vdbeek
    > > Rijksuniversiteit Groningen, Informatiekunde
    > > Pb 716, 9700 AS Groningen, The Netherlands
    > > tel. +31.50.3635977, fax +31.50.3636855
    > >
    > >
    >
    > --
    > Ted Pedersen
    > http://www.d.umn.edu/~tpederse
    >

    -- 
    ______________________________________________________________________
    Stefan Evert                                     purl.org/stefan.evert
    http://www.collocations.de/                             schtepf@gmx.de
    



    This archive was generated by hypermail 2b29 : Fri Nov 12 2004 - 12:41:27 MET