Re: Corpora: Statistical significance of tagging differences

Jean Veronis (Jean.Veronis@lpl.univ-aix.fr)
Wed, 24 Mar 1999 16:36:04 +0100

There is a coefficient that I find very useful to evaluate intertagger
agreement, Cohen's kappa, which "substracts" from the observed agreement
the agreement that would be obtained by pure chance, given the marginal
probablilities of tags. Such a correction is very important, since a figure
such as 95% agreement is very good if chance agreement is only 20%, whereas
it is not very impressive if chance agreement is 90%!

More exactly, k is defined as:

ObservedAgreement - ExpectedAgreement
k = -------------------------------------
1 - ExpectedAgreement

The coefficient is equal to 0 when there is no more agreement than chance,
it is equal to 1 when there is perfect agreement.

In the examples above, k would be respectively 94% and 50%.

* Original article:

Cohen, J. (1960). A coefficient of agreement for nominal scales.
Educational and Psychological Measurement, 20, 37-46.

* Extension to more than 2 annotators:

Davies, M., Fleis, J. L. (1982). measuring agreement for multinomial data.
Biometrics, 38, 1047-1051.

* Extension for partial agreement:

Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision
for scaled disagreement or partial credit. Psychological Bulletin, (70)4,
213-220.

* Recent articles using k in CL:

Bruce, R., Wiebe, J. (1998). Word sense distinguishability and inter-coder
agreement. Proceedings of the 3rd Conference on Empirical Methods in
Natural Language Processing (EMNLP-98). Association for Computational
Linguistics SIGDAT, Granada, Spain, June 1998.

Carletta, J. (1996). Assessing agreement on classification tasks: the kappa
statistics. Computational Linguistics, 22(2), 249-254.

Véronis, J. (1998a). A study of polysemy judgements and inter-annotator
agreement. Programme and advanced papers of the Senseval workshop, 2-4
September 1998. Herstmonceux Castle, England.
[http://www.up.univ-mrs.fr/~veronis/pdf/1998senseval.pdf]