Re: [Corpora-List] Statistical tests for corpus studies

From: Adam Kilgarriff (adam.kilgarriff@itri.brighton.ac.uk)
Date: Thu May 08 2003 - 17:19:28 MET DST

  • Next message: Jean Veronis: "Re: [Corpora-List] English-French parallel corpus"

    Rayson, Paul wrote:

    >But there is a problem with the Mann-Whitney test of too many zeros in the slices, as your IJCL paper points out Adam. For example, in the LOB and Brown comparison only words with a frequency of 30 or more (in the joint corpus) had few enough zeros for the test to be applicable. This means that 92% of the word types in the joint corpus were omitted from the comparison.
    >
    But if there isn't enough data we shouldn't be drawing any inferences,
    so that seems right. A name or technical term that gets used lots of
    times, but in only one or two documents, is not good for basing any
     inferences on. (Some thought has to be given to slice size, and how
    the corpus is to be sliced up, which will interact with the number of
     non-zero values you'll get for the test.)

    A couple of people asked for an e-version of the 'Comparing Corpora' - see

    http://www.itri.bton.ac.uk/~Adam.Kilgarriff/publications.html#2001

    Adam

    -- 
    

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Adam Kilgarriff ITRI, University of Brighton tel: (44) 1273 642919 Lewes Road, Brighton BN2 4GJ, UK fax: (44) 1273 642908 adam@itri.bton.ac.uk http://www.itri.bton.ac.uk/~Adam.Kilgarriff and Lexicography MasterClass Ltd. 71 Freshfield Road, Brighton BN2 0BL, UK tel: (44) 1273 705773 adam@lexmasterclass.com http://www.lexmasterclass.com %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



    This archive was generated by hypermail 2b29 : Thu May 08 2003 - 17:20:22 MET DST