Re: Corpora: statistics in learner English

From: P. Kaszubski (przemka@amu.edu.pl)
Date: Fri Jan 18 2002 - 11:24:44 MET

  • Next message: Elia Yuste-Rodrigo: "Corpora: *** 2nd CfP: LANGUAGE RESOURCES FOR TRANSLATION WORK AND RESEARCH --- An International Workshop at LREC 2002 ***"

    Hello,

    I'm not a statistician but ...

    On 17 Jan 2002 at 23:06, xiaotian guo wrote:

    > Currently, I am comparing the frequency data of a learner corpus and native
    > speaker corpus (they have approximately the same size)and have some
    > statistical queries. For exampe: For the verb KEEP, I have got the freqency
    > of the each verb form in the two corpora as follows:
    >
    > Learner corpus
    > keep 348 (88.5%)
    > keeps 15 (3.8%)
    > keeping 9 (2.3%)
    > kept 21 (5.4%)
    > Total 392 (100%)
    >
    > Native speaker corpus
    > keep 99 (58.2%)
    > keeps 14 (8.2%)
    > keeping 32 (18.8%)
    > kept 25 (14.7%)
    > Total 170 (99.9%)
    >
    > According to the percentage each form takes in its perspective corpus, I can
    > easily see a large differenc between the use of "keep" in learner corpus and
    > that in native speaker corpus (88.5%:58.2%). But one problem to my
    > interpretation is "Why do you think this difference (88.5%:58.2%) is
    > significant and other differences are not?"

    It is easy to see these frequencies are the highest, so any statistical testing will be
    more reliable than in the other cases. BTW - what are you comparing: (normalized)
    frequencies or within-corpus percentages? To my mind you really should do both.
    Also, do you take account of multi word expressions with KEEP (of which there are
    quite a few) - what kind of ultimate answer do you expect to get from comparing one
    lemma frequency profiles in two corpora? Often your methodology will be linked to
    what and how much you want to be able conclude at the end...

    I would think there is no way to
    > answer this question by means of some statistic help because it really
    > depends on individual circumstances and it will be difficult if not possible
    > to give a demarcation to such kind of comparison. But to make sure about
    > this point, I would like to raise this question to the list members.
    >
    > Someone suggested "chi sqare" to me. But after some initial reading, I found
    > it can only review the relationship between the observed frequency and
    > expected frequency and it is based on null hypothesis. It can only tell me
    > whether there is a significant difference as a whole rather than
    > individually concerning the use of the different forms of KEEP in the two
    > corpora. It seems it cannot answer the question I have: why do you think the
    > use of the base form "keep" is significantly different?

    Chi-square is often fallible, that is true, especially when you compare high-frequency
    words, which almost always display significant differences. Years ago, I tried to follow
    Adam Kilgarriff's suggestion to use a variation of the Mann-Whitney ranks test after
    slicing my corpora into same-sized 'subcorpora' and then calculating frequencies from
    all of them, ordering them by rank and conducting the test. However, in order to be
    able to do this one needs sizeable learner & native corpora in the first place. More
    details in:

    Kilgarriff, A. 1996. "Comparing word frequencies across corpora: Why chi-square
    doesn't work, and an improved LOB-Brown comparison" In Proceedings from ALLC-
    ACH'96: 169-172.

    problem I just raised and try to
    > detect differences in two corpora as a whole, what is the best statistic
    > mothod to use? Oakes pointed out the weakness of Chi-sqare in Statistics in
    > Corpus Linguistics:
    >
    > The Chi-square test is used for the comparison of frequency data. Kilgariff
    > has shown that this test should be modified when working with corpus data,
    > since the null hypothesis is always rejected when working with
    > high-frequency words.
    >
    As above.

    BTW, at least in applied linguistics many scholars give up the idea of using precise
    statistical metrics because the reliability of the "significance" of the results can often
    be called into question - there are just so many variables involved... (sample size,
    topic comparability, author age etc etc.). many of us simpy take the percentages and
    frequencies and comment upon them.

    Hope you find this helpful enough.

    Przemek

    =======================================
    Dr Przemyslaw Kaszubski
    t: +48 61 8293515
    e: przemka@amu.edu.pl
    w: http://elex.amu.edu.pl/ifa/staff/kaszubski.html

    (ENGLISH) LEARNER CORPORA PAGE:
    http://main.amu.edu.pl/~przemka

    COMPREHENSIVE CORPORA BIBLIOGRAPHY:
    http://main.amu.edu.pl/~przemka/welcome.html#Corpbibl

    School of English
    Adam Mickiewicz University
    Al. Niepodleglosci 4
    61-874 Poznan
    t: +48 61 8293506
    f: +48 61 8523103
    w: http://elex.amu.edu.pl/ifa
    =======================================



    This archive was generated by hypermail 2b29 : Fri Jan 18 2002 - 11:39:26 MET