RE: Corpora: statistics in learner English

From: Rayson, Paul (rayson@exchange.lancs.ac.uk)
Date: Fri Jan 18 2002 - 13:03:19 MET

  • Next message: Morten H. Christiansen: "Corpora: Two Postdoctoral positions in Cognitive Science"

    Dear Guo, Przemek,

    I would suggest you can use the log-likelihood (sometimes called the likelihood
    ratio) as an alternative to chi-squared. It can be calculated even for low
    frequency/expectation words.

    You can calculate the LL value for the lemma as well as the variants, you don't
    mention the size of the corpora, so I've assumed each one is a million words.
    It's the ratio of the two corpus sizes that it important I think.

    > >
    > > Learner corpus
    > > keep 348 (88.5%)
    > > keeps 15 (3.8%)
    > > keeping 9 (2.3%)
    > > kept 21 (5.4%)
    > > Total 392 (100%)
    > >
    > > Native speaker corpus
    > > keep 99 (58.2%)
    > > keeps 14 (8.2%)
    > > keeping 32 (18.8%)
    > > kept 25 (14.7%)
    > > Total 170 (99.9%)
    >
    Rounded to zero d.p. and relative to the corpus size rather than the lemma
    total:

    Lemma KEEP: LL = 90
    keep 147
    keeps 0
    keeping 14
    kept 0

    This shows that keep is significantly overused and keeping is significantly
    underused. But of course the lemma being overused as a whole is an important
    factor to consider in your studies.

    For more details on log-likelihood, see:

    Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling.
    In proceedings of the workshop on Comparing Corpora, held in conjunction with
    the 38th annual meeting of the Association for Computational Linguistics (ACL
    2000). 1-8 October 2000, Hong Kong, pp. 1 - 6.
    http://www.comp.lancs.ac.uk/computing/users/paul/publications/rg_acl2000.pdf

    I also have an online LL calculator:
    http://lingo.lancs.ac.uk/llwizard.html

    Regards,
    Paul.



    This archive was generated by hypermail 2b29 : Fri Jan 18 2002 - 13:05:42 MET