Re: [Corpora-List] Developing and testing new similarity measures for word clustering

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Sun Oct 10 2004 - 23:42:36 MET DST

  • Next message: Serge Sharoff: "RE: [Corpora-List] Chinese language corpus"

    Normand,
    You could empirically evaluate the output of a word-clustering program
    by comparing results with an established tagset - for example,
    word-clusters learnt on an English corpus can be evaluated by seeing
    whether words in a cluster share the same PoS-tag in an established
    English corpus-based tagset such as tghat used in tagged LOB corpus;
    see:

    Hughes J and Atwell E. 1994. The automated evaluation of inferred word
    classifications, in Cohn A G, (editor), Proceedings of ECAI'94: 11th
    European Conference on Artificial Intelligence, pages 535-540, John
    Wiley, Chichester.
    http://www.comp.leeds.ac.uk/nlp/papers/hughes+atwell94ecai.ps.Z

      - clustering of English word-tpyes into grammatical classes, based on
    similarity of contexts in a corpus. Several alternative metrics are
    evaluated, by comparing clusters produced with LOB Corpus tagset.

    More recently, Leeds PhD student Andy Roberts has used this
    word-clustering evaluation technique, comparison with LOB corpus
    tagging, to evaluate a different word-clustering approach based on
    function-word-collocation profile patterns, see:

    Roberts, Andrew. 2002. Automatic Acquisition of Word Classification using
    Distributional Analysis of Content Words with Respect to Function Words.
    Unpublished Research Report, School of Computing, University of Leeds
    http://www.comp.leeds.ac.uk/andyr/research/abstracts/roberts01autoacquire.html

    Eric Atwell, School of Computing, Leeds University

    On Fri, 8 Oct 2004, Normand Peladeau wrote:

    > I have been reviewing some of the similarity measures used to perform word
    > clustering (Jaccard, Dice, Simple Matching, correlation, etc.) and I came to
    > the conclusion that many of those measures had some metric problems that
    > probably make them non optimal for word clustering.
    >
    > I am working now on some modified versions of those indices and I need some
    > ways to benchmark those new similarity measures. I would like to have a
    > series of benchmarks for several kinds of application (dimension reduction,
    > automatic identification of themes, automatic taxonomy development, etc.).
    >
    > I would like suggestions for ways to benchmark those new measures and compare
    > their performance with the more traditional ones. Any idea, reference, data
    > set would be welcome.
    >
    > I am also looking for existing articles where those measures have been
    > compared (either empirically or theoretically)
    >
    >
    > Thanks,
    >
    > Normand Peladeau
    > Provalis Research
    >
    >
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Computer Vision and Language research group,
    School of Computing, University of Leeds, LEEDS LS2 9JT, England
    TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric
    



    This archive was generated by hypermail 2b29 : Sun Oct 10 2004 - 23:57:06 MET DST