Re: [Corpora-List] How to word presentation for word clustering?

From: Clive De Silva (cd334@cam.ac.uk)
Date: Wed Jul 07 2004 - 17:16:29 MET DST

  • Next message: Menno van Zaanen: "[Corpora-List] ICGI04 - ACCEPTED PAPERS, REGISTRATION, GRANTS, DEMO, COMPETITION"

    Yes, sorry if that wasn't clear.
    n and w are the same 'word' but n comes from the document and IDF(w) from
    the large corpus.

    Clive
    ----- Original Message -----
    From: "Gaël Dias" <ddg@di.ubi.pt>
    To: "Clive De Silva" <cd334@cam.ac.uk>
    Cc: <chenwl@mail.neu.edu.cn>; <corpora@hd.uib.no>
    Sent: Wednesday, July 07, 2004 4:12 PM
    Subject: Re: [Corpora-List] How to word presentation for word clustering?

    Be careful,

    IDF is unique for a word and does not depend on the document
    so that you have:

    vector w = { tf(1)*IDF(w), tf(2)*IDF(w)....,tf(n)*IDF(w))}

    Gaël.

    Clive De Silva wrote:
    > Dear Chen Wenliang,
    >
    > I am using TF*IDF values as my representation for words.
    > vector w = { tf(1)*IDF(1), tf(2)*IDF(2)....,tf(n)*IDF(n))} where the IDF
    is
    > computed from a large corpus. This seems to give better results than just
    > the raw frequency counts.
    > The representations I investigated were: TF, TF*IDF and simple binary(1
    > represents the word existing in the vector and 0 if it isn't) counts.
    >
    > Regards,
    >
    > Clive De Silva
    > University of Cambridge

    -- 
    ---------------------------------------------------------
    Gaël Harry Dias, PhD            | Assistant Professor
    Human Language Technology Group | [www.di.ubi.pt/~ddg]
    Computer Science Department     | [ddg@di.ubi.pt]
    Beira Interior University       | [Tel: +351 275 319 700]
    6201-001 - Covilhã - PORTUGAL   | [Fax: +351 275 319 732]
    ---------------------------------------------------------
    



    This archive was generated by hypermail 2b29 : Wed Jul 07 2004 - 17:13:43 MET DST