[Corpora-List] Mann-Whitney ranks test

From: Don Hardy (Don.Hardy@Colostate.edu)
Date: Mon Nov 22 2004 - 23:15:01 MET

  • Next message: Gaël Dias: "[Corpora-List] Benchmark for Topic Segmentation Evaluation"

    Hi, everyone:

    Can anyone advise me as to the use of the Mann-Whitney ranks test to
    determine lexical differences between a homogeneous collection (such as
    about 260,000 words of a single author) and a heterogeneous corpus (such
    as the fiction subcorpora of the Brown Corpus)? Or perhaps can anyone
    point me in the direction of a good resource that discusses the issue?
    I had thought about splitting each corpus into segments of about 20,000
    words and then running Mann-Whitney tests against lexical items of
    interest (body parts in particular). After having read Adam Kilgariff
    ("Comparing Corpora" 2001 and others) I know that with heterogeneous
    corpora the Mann-Whitney goes some way towards defeating the ease of
    rejecting the null hypothesis due to high frequency words, that ease
    making inappropriate hypothesis testing with chi-square or
    log-likelihood. If a homogeneous corpus is split into 20,000-word
    adjacent segments for the Mann-Whitney, isn't it likely that the
    bunchiness characteristic will still be present in the homogeneous
    samples? And, furthermore, is it appropriate to use statistical tables
    to test the null hypothesis given that the samples from the homogeneous
    corpus are all from the same author while the heterogeneous samples are
    from different authors, on average about 10 different ones per
    20,000-word sample?

    Many thanks,

    Don

    -- 
    

    Don.Hardy@Colostate.edu http://textant.colostate.edu



    This archive was generated by hypermail 2b29 : Mon Nov 22 2004 - 23:18:23 MET