Re: Corpora: Statistical significance of tagging differences

Chris Brew (Chris.Brew@edinburgh.ac.uk)
Wed, 17 Mar 1999 12:10:04 +0000

> Hi,
>
> I've been experimenting with PoS taggers operating under different
> conditions so
> that I have several different taggings of the same corpus. I also
> have a "gold
> standard" annotation for that text, so I can work out the percentage correct
> for each tagging.
>
> I was wondering if anyone knows of the appropriate statistical
> tests which could
> be applied to determine whether the differences in tagging performace are
> statistically significant?
>
> Any pointers would be appreciated.
>
> Thanks in advance,
> Mark Stevenson
>
>
> ----------------------------------------------------------------------
> --------
> Mark Stevenson
> Research Assistant marks@dcs.shef.ac.uk
> Natural Language Processing Group http://www.dcs.shef.ac.uk/~marks
> Sheffield University (0114) 222 1899
> -----------------------------------------------------------------------------

cf van Halteren, Zavrel and Daelemans, proceedings Coling-98, vol1 pp 491ff,
footnote 7, using McNemar's chi-square. Since in POS tagging we are
typically dealing with large corpora, even numerically small
differences in error rate, are likely to be statistically
significant. Statistical
significance is of course not the only relevant criterion.

C