Re: Corpora: Collaborative effort

From: Ng Hwee Tou (nhweetou@dso.org.sg)
Date: Wed Jun 14 2000 - 16:09:29 MET DST

  • Next message: barlow@rice.edu: "Corpora: New Book: Usage-Based Models of Language"

    > In the case of Semcor and DSO, the sense inventory was the same (WordNet).
    > The rate of agreement I mentioned was the agreement we would get by
    > tagging all instances with the most frequent sense for the word in the
    corpus.

    As reported in our ACL SIGLEX99 workshop paper ("A Case Study on
    Inter-Annotator Agreement for Word Sense Disambiguation", by Hwee Tou Ng,
    Chung Yong Lim, and Shou King Foo), for the 30,315 sentences that are common
    to both Semcor and the DSO corpus, the rate of inter-annotator agreement is
    56.7%. Our calculation indicates that the most frequent senses (of the 191
    words) in the intersection corpus of 30,315 Semcor sentences account for
    53.2%.

    However, part of the reason is that many of these 191 words have very skewed
    sense distribution, such that the most frequent sense of a word accounts for
    a large number of the word sense occurrences. If we restrict our attention
    to half of these 191 words (61 nouns and 35 verbs) where the most frequent
    sense occurs comparative less, then the Semcor-DSO agreement rate for these
    61 nouns is 10% higher than the most frequent sense occurrence. And for the
    35 verbs is 16% higher.

    Another point to note is that the inter-annotator agreement rate has a lot
    to do with the very refined sense distinction used in WordNet. As reported
    in our SIGLEX99 paper, if we allow coarser sense classes, then the
    inter-annotator agreement for a subset of 53 nouns and 42 verbs can be
    higher than 93%.

    Hwee Tou
    DSO National Laboratories, Singapore



    This archive was generated by hypermail 2b29 : Wed Jun 14 2000 - 16:06:37 MET DST