Corpora: size of training corpus and tagset size

From: Sofie J K (
Date: Wed Jan 09 2002 - 12:14:50 MET

  • Next message: Lieve Vangehuchten: "Corpora: lemmatized frequency list of Spanish"

    Dear readers of the corpora list,

    1) I have a question concerning the relation between the required
    minimum size of a training corpus and the performance of the tagger
    being trained. In an article I found a reference to:

    J. M. Baker, 1982, The Performing Arts - How to measure up! Proceedings
    of the NBS Workshop on Standardisation for Speech I/O Technology, pp

    In this paper Baker describes the minimal number of test tokens using
    the following formula:

    n=4 x 104+log 1/x

    where "x" is the error rate of the tagger.
    My question is if this formula is the only one or if there are any other
    formulas for computing size of training corpora?

    2) I also have a second question concerning studies made about the
    relation between a tagger's performance in relation to the tagset size.
    I found two articles so far (Zavrel and Daelemans, 1999, Recent Advances
    in Memory-Based Part-of-Speech Tagging, Tilburg University) and
    (Elworth, 1995, Tagset Design and Inflected Languages, Sharp
    Laboratories of Europe Ltd, Oxford.)
    Does anyone know of any other studies?

    Best regards,

    Sofie Johansson Kokkinakis

    * Sofie Johansson Kokkinakis*
    * Systemanalyst/ Ph.D Student    *
    * Språkdata, Inst. för svenska språket  Tel: +46 (0)31 773 5281        *
    * (Dept. of Swedish Language)           Fax: +46 (0)31 773 4455        *
    * Göteborgs universitet, Box 200        SE 405 30 GÖTEBORG, Sweden     *
    *       Computers are not intelligent. They just think they are.       *

    This archive was generated by hypermail 2b29 : Fri Jan 11 2002 - 16:00:16 MET