Corpora: type/token ratio

From: De Cock Sylvie (decock@lige.ucl.ac.be)
Date: Wed Jan 16 2002 - 14:04:18 MET

  • Next message: Kees Koster: "Re: Corpora: Text Classification System"

    Dear List members,

    I'm working on recurrent sequences of words in learner and native speaker
    writing (NS corpus: 106,112 words, NNS corpus: 100,575) and have a question
    regarding the use of the type/token ratio to measure word combination
    variation. As the 'standard' type/token ratio is not reliable when
    comparing corpora of different sizes, I have used the log type/token ratio
    as it is thought to remain constant for samples of different sizes (Herdan?
    1960: 26).
    I have a niggling worry ... I calculated both the 'standard' type/token
    ration and the log type/token ratio (for NS and learner 2-, 3-, 4- and
    5-word combinations) and found that the results for 5-word combinations
    didn't go in the same 'direction' (see below). Should I trust the log
    type/token ratio? Any help or suggestions would be welcome.

    Results for 5-word combinations:
    NS types: 46
    NS tokens: 161

    NNS types: 79
    NNS tokens: 289

    NS standard type/token ratio: 0.285714
    NS log type/token ratio: 0.753461
    NNS standard type/token ratio: 0.273356
    NNS log type/token ratio: 0.771111

    Thank you very much in advance.
    Best wishes

    Sylvie De Cock
    Université catholique de Louvain
    Collège Erasme
    1, Place Blaise Pascal
    B-1348 Louvain-la-Neuve
    BELGIUM



    This archive was generated by hypermail 2b29 : Fri Jan 18 2002 - 16:10:28 MET