Re: Re: [Corpora-List] How to word presentation for word clustering?

From: chen wenliang (chenwl@mail.neu.edu.cn)
Date: Thu Jul 08 2004 - 03:45:40 MET DST

  • Next message: Jana Diesner: "[Corpora-List] AutoMap2.0 - Software for Network Text Analysis released"

    Thanks your reply!

    Did you try the method "TF*IDF values as word representation" for word clustering?

    I define when two words should be in the same cluster:

    For example: football and basketball should be the same cluster because they always appear in "sports" category.

    So I prefer two words in the same cluster when they always appear in the same categories.

    But I havnt a large labeled documents corpus(label categories) to use class distribution of words for clustering(as Baker 98 says).

    I want to cluster words on condition that a large unlabeled documents corpus.
    กกกก
    Regards,

    Chen Wenliang chenwl@mail.neu.edu.cn 2004-07-08
    ======= 2004-07-07 Original Message=======

    >Dear Chen Wenliang,
    >
    >I am using TF*IDF values as my representation for words.
    >vector w = { tf(1)*IDF(1), tf(2)*IDF(2)....,tf(n)*IDF(n))} where the IDF is
    >computed from a large corpus. This seems to give better results than just
    >the raw frequency counts.
    >The representations I investigated were: TF, TF*IDF and simple binary(1
    >represents the word existing in the vector and 0 if it isn't) counts.
    >
    >Regards,
    >
    >Clive De Silva
    >University of Cambridge
    >----- Original Message -----
    >From: "chen wenliang" <chenwl@mail.neu.edu.cn>
    >To: <corpora@hd.uib.no>
    >Sent: Wednesday, July 07, 2004 10:17 AM
    >Subject: [Corpora-List] How to word presentation for word clustering?
    >
    >
    >Dear all,
    >
    >I am looking for a word presentation for word clustering.
    >
    >I am doing a project that is about word clustering. Now I use a presentation
    >that word is presented as
    >
    >a vector w = {tf(1),tf(2),...,tf(n)}, tf(i) is the frequency of the word in
    >document i. Then I use k-means
    >
    >as the clustering algorithm.
    >
    >Thanks all.
    >กกกก
    >
    >regards,
    >
    >Chen Wenliang chenwl@mail.neu.edu.cn
    >
    >Nlplab, Northeastern University, China.
    >
    >2004-07-07

    = = = = = = = = = = = = = = = = = = = =
                            



    This archive was generated by hypermail 2b29 : Thu Jul 08 2004 - 03:49:44 MET DST