Corpora: statistics in learner English

From: xiaotian guo (
Date: Thu Jan 17 2002 - 16:06:10 MET

  • Next message: LITTLECHILD Peter: "Re: Corpora: A Welsh lexical database and frequency count"

    Dear All

    First, let me thank all those who replied to me concerning my request about
    "overuse and underuse of learner English" a couple weeks ago.

    Currently, I am comparing the frequency data of a learner corpus and native
    speaker corpus (they have approximately the same size)and have some
    statistical queries. For exampe: For the verb KEEP, I have got the freqency
    of the each verb form in the two corpora as follows:

    Learner corpus
    keep 348 (88.5%)
    keeps 15 (3.8%)
    keeping 9 (2.3%)
    kept 21 (5.4%)
    Total 392 (100%)

    Native speaker corpus
    keep 99 (58.2%)
    keeps 14 (8.2%)
    keeping 32 (18.8%)
    kept 25 (14.7%)
    Total 170 (99.9%)

    According to the percentage each form takes in its perspective corpus, I can
    easily see a large differenc between the use of "keep" in learner corpus and
    that in native speaker corpus (88.5%:58.2%). But one problem to my
    interpretation is "Why do you think this difference (88.5%:58.2%) is
    significant and other differences are not?" I would think there is no way to
    answer this question by means of some statistic help because it really
    depends on individual circumstances and it will be difficult if not possible
    to give a demarcation to such kind of comparison. But to make sure about
    this point, I would like to raise this question to the list members.

    Someone suggested "chi sqare" to me. But after some initial reading, I found
    it can only review the relationship between the observed frequency and
    expected frequency and it is based on null hypothesis. It can only tell me
    whether there is a significant difference as a whole rather than
    individually concerning the use of the different forms of KEEP in the two
    corpora. It seems it cannot answer the question I have: why do you think the
    use of the base form "keep" is significantly different?

    Another query is that if I forget about the problem I just raised and try to
    detect differences in two corpora as a whole, what is the best statistic
    mothod to use? Oakes pointed out the weakness of Chi-sqare in Statistics in
    Corpus Linguistics:

    The Chi-square test is used for the comparison of frequency data. Kilgariff
    has shown that this test should be modified when working with corpus data,
    since the null hypothesis is always rejected when working with
    high-frequency words.

    I wonder whether there is another test which could help with corpora

    With thanks

    Guo Xiaotian

    Join the world’s largest e-mail service with MSN Hotmail.

    This archive was generated by hypermail 2b29 : Thu Jan 17 2002 - 16:07:52 MET