Re: [Corpora-List] token clustering tool

From: Normand Peladeau (peladeau@simstat.com)
Date: Wed May 12 2004 - 01:07:14 MET DST

  • Next message: Steven Bird: "Re: [Corpora-List] token clustering tool"

    At 2004-05-11 03:24, you wrote:
    >Dear all,
    >
    >Does anyone know of a tool (or algorithm), preferably available freely
    >for research purposes, that takes as its input a corpus only and
    >produces as its output clusters of tokens that occur close to each other
    >relatively often?

    I created such a software but it is a commercial product. You already
    obtained suggestions for links to free clustering routines. If you can't
    find something that suits you in public domain software, take a look at our
    software (www.simstat.com/wordstat.htm)

    However I hope you won't mind me asking a few questions related to the
    usefulness of clustering corpus. There are hundreds of ways of clustering
    the same text corpus. To give you just a few examples of variations:
        * You can define proximity as two words occurring in a sentence, in a
    paragraph, an entire text, or a small window of words. Each of them will
    give you different pictures, different information.
        * You can use many kinds of similarity indices (some based on mere
    occurrence like the Jaccard, often used in the clustering of text, or based
    on frequencies like the Cosine coefficient). I personally like to use an
    "inclusion index" that was developed in library science but that I didn't
    saw applied anywhere else (but I didn't look very hard).
        * You can use one of the many hierarchical clustering algorithms or use
    something like a K-means or J-means clustering method.
        * You may also apply various feature selection methods.
    I wonder whether someone have written a paper (or a book) on how those
    various ways of performing cluster on textual data differ, how each way may
    tackle different realities. I have seen some articles comparing a few of
    those features for the analysis of textual data, and I am also familiar
    with books devoted to clustering of numerical data and discussing all those
    aspects in this context, but what I am looking for is a more comprehensive
    discussion of all those aspects of clustering when applied to the analysis
    of textual data. Any suggested reading?

    Normand Peladeau
    Provalis Research
    www.simstat.com



    This archive was generated by hypermail 2b29 : Wed May 12 2004 - 01:24:10 MET DST