Dear readers of the corpora list,
1) I have a question concerning the relation between the required
minimum size of a training corpus and the performance of the tagger
being trained. In an article I found a reference to:
J. M. Baker, 1982, The Performing Arts - How to measure up! Proceedings
of the NBS Workshop on Standardisation for Speech I/O Technology, pp
25-33.
In this paper Baker describes the minimal number of test tokens using
the following formula:
n=4 x 104+log 1/x
where "x" is the error rate of the tagger.
My question is if this formula is the only one or if there are any other
formulas for computing size of training corpora?
2) I also have a second question concerning studies made about the
relation between a tagger's performance in relation to the tagset size.
I found two articles so far (Zavrel and Daelemans, 1999, Recent Advances
in Memory-Based Part-of-Speech Tagging, Tilburg University) and
(Elworth, 1995, Tagset Design and Inflected Languages, Sharp
Laboratories of Europe Ltd, Oxford.)
Does anyone know of any other studies?
Best regards,
Sofie Johansson Kokkinakis
-- ************************************************************************ * Sofie Johansson Kokkinakis sofie.johansson.kokkinakis@svenska.gu.se* * Systemanalyst/ Ph.D Student http://svenska.gu.se/~svesj/ * * Språkdata, Inst. för svenska språket Tel: +46 (0)31 773 5281 * * (Dept. of Swedish Language) Fax: +46 (0)31 773 4455 * * Göteborgs universitet, Box 200 SE 405 30 GÖTEBORG, Sweden * * Computers are not intelligent. They just think they are. * ************************************************************************
This archive was generated by hypermail 2b29 : Fri Jan 11 2002 - 16:01:06 MET