Corpora: size of training corpus and tagset size

From: Sofie J K (Sofie.Johansson.Kokkinakis@svenska.gu.se)
Date: Wed Jan 09 2002 - 12:14:50 MET

Next message: Lieve Vangehuchten: "Corpora: lemmatized frequency list of Spanish"

Previous message: Yuri Tambovtsev: "Corpora: Phonostatistical data on Aboriginal Australian languages"
Next in thread: Sofie J K: "Corpora: size of training corpus and tagset size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear readers of the corpora list,

1) I have a question concerning the relation between the required
minimum size of a training corpus and the performance of the tagger
being trained. In an article I found a reference to:

J. M. Baker, 1982, The Performing Arts - How to measure up! Proceedings
of the NBS Workshop on Standardisation for Speech I/O Technology, pp
25-33.

In this paper Baker describes the minimal number of test tokens using
the following formula:

n=4 x 104+log 1/x

where "x" is the error rate of the tagger.
My question is if this formula is the only one or if there are any other
formulas for computing size of training corpora?

2) I also have a second question concerning studies made about the
relation between a tagger's performance in relation to the tagset size.
I found two articles so far (Zavrel and Daelemans, 1999, Recent Advances
in Memory-Based Part-of-Speech Tagging, Tilburg University) and
(Elworth, 1995, Tagset Design and Inflected Languages, Sharp
Laboratories of Europe Ltd, Oxford.)
Does anyone know of any other studies?

Best regards,

Sofie Johansson Kokkinakis

--
************************************************************************
* Sofie Johansson Kokkinakis   sofie.johansson.kokkinakis@svenska.gu.se*
* Systemanalyst/ Ph.D Student           http://svenska.gu.se/~svesj/   *
* Språkdata, Inst. för svenska språket  Tel: +46 (0)31 773 5281        *
* (Dept. of Swedish Language)           Fax: +46 (0)31 773 4455        *
* Göteborgs universitet, Box 200        SE 405 30 GÖTEBORG, Sweden     *
*       Computers are not intelligent. They just think they are.       *
************************************************************************

Next message: Lieve Vangehuchten: "Corpora: lemmatized frequency list of Spanish"
Previous message: Yuri Tambovtsev: "Corpora: Phonostatistical data on Aboriginal Australian languages"
Next in thread: Sofie J K: "Corpora: size of training corpus and tagset size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Jan 11 2002 - 16:00:16 MET