[Corpora-List] Developing and testing new similarity measures for word clustering

From: Normand Peladeau (peladeau@simstat.com)
Date: Tue Oct 12 2004 - 18:00:34 MET DST

  • Next message: Mª Belén Díez Bedmar: "[Corpora-List] Chomsky"

    Many thanks to all those who answered my question about methods for
    comparing similarity measures. I am overwhelmed with new articles, new
    perspectives and will need several weeks (or months) in my busy schedule to
    assimilate all this information.

    Many of the papers suggested to me were quite relevant, but I was
    especially impressed by the Julie Wards thesis on similarity measures. I
    share with her the view that some similarity measures may be better for
    some applications, while others may be appropriate for other types of
    applications.

    One type of application that I didn't saw mentioned was knowledge discovery
    and I believe that it may require very different similarity measures than
    those used for automatic thesaurus construction, text retrieval, etc..

    In a project I am working on right now, we try to identify abnormally high
    relationship of unrelated words (it is a project related to ergonomic
    problems and human errors in airplane flights). We found the following
    measure to be very sensitive to the discovery of unexpected relationships:

            Inclusion index = a / min (a+b, a+c)

    This index of inclusion which varies between 0 and 1 has been used in
    library sciences to identify hierarchical relationship between words. One
    interesting property is that it will reach a maximum value of 1 if word #1
    is always associated with word #2, despite the fact that word #2 may not
    always be associated with word #1. For example, if "baseball" appear 10
    times and may be always associated with "sport" but, "sport" may appear 100
    times, only 1/10 of those times in the presence of "baseball". The
    inclusion index take a value of 1 because one word is considered to be
    included in the other one (it seems to measure a kind of hyponymy relationship.

    We were able to identify with this specific index ergonomic problems that
    were real, but that would have remain undetected if we had used other
    similarity measures. I wonder whether such a measure has been used for
    other types of application.

    There seems to be a lot of empirical studies on those indices, but I have
    not seen a lot of theoretical evaluation (but I am not an expert in this
    area). I am under the impression that many basic theoretical questions
    remain unanswered when we choose a similarity measure. Here are a few of
    those questions:

            1) Should we consider a join absence (both words are absent from a
    context) as an indication of their similarity?
            2) Should we consider negative correlation (one word occur but not the
    other) as an indication of their dissimilarity or lack of similarity? But
    what about synonyms?
            3) Should we consider the probabilistic nature of co-occurrences?
            etc.

    Many of the measures we use make different assumptions about those questions.

    For example, from what I know, it seems that those index make the following
    assumptions on those 3 questions:

            Simple matching 1) Yes 2) Partially 3) No
            Jaccard & Dice 1) No 2) Partially 3) No
            Correlation 1) Yes 2) Yes 3) Yes

    I have seen such kind of discussion in biology and ecology but I don't
    remember seeing a paper discussing those basic questions for the analysis
    of textual data. Does someone knows a good discussion on those theoretical
    issues?

    Best regards,

    Normand Peladeau
    Provalis Research
    www.simstat.com



    This archive was generated by hypermail 2b29 : Wed Oct 13 2004 - 07:47:56 MET DST