Re: [Corpora-List] semantic similarity

From: Dominic Widdows (widdows@maya.com)
Date: Thu Jan 20 2005 - 20:04:22 MET

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] semantic similarity"

    Dear Jana,

    Some of the infomap project's tools and methods may also help you -
    links to demos, software and many papers are available from
    http://infomap.stanford.edu

    The main piece of software available performs latent semantic analysis
    (there's a demo at infomap.stanford.edu/webdemo). While the current
    demo requires that you input an initial set of query terms, the
    software does build a dictionary file and it would be very easy to
    iterate through this and output pairs of terms whose latent semantic
    similarity is above a given threshold. (We have done this is the past
    for harvesting translation pairs from parallel corpora). We have also
    found LSA to be a very useful filter for relationships extracted by
    other means (for example, if you have two strings with similar
    orthography you can check using LSA to see if they are also
    contextually similar).

    If any of the above material sounds useful to you let me know and I may
    be able to help with more details.
    I too am in Pittsburgh - must be a large part of a small world :)
    Best wishes,
    Dominic

    > Hi Jana,
    >
    > have you looked at Latent Dirichlet Allocation, developed by Blei,
    > Jordan and Ng? Take a look at Blei's homepage:
    > http://www.cs.berkeley.edu/~blei/
    >
    > in particular,
    > Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of
    > Machine Learning Research, 3:993-1022, January 2003.
    >
    > Dave Blei is now a postdoc at CMU, and I'm a grad student here -- so
    > feel free to stop by.
    >
    > Best,
    > -Leo
    >
    > On Thu, 20 Jan 2005, Jana Diesner wrote:
    >
    >> Dear list members,
    >>
    >> We are looking for strategies, algorithms or code to automatically
    >> find
    >> single terms or multiple adjacent terms that are semantically similar
    >> within
    >> and across documents. The approach must not require POS tagging or an
    >> initial input of a reference term to the system. The resulting
    >> clusters of
    >> semantically similar terms suggested by the system do not need to be
    >> exclusive. We are familiar with secondstring, the software developed
    >> by
    >> William Cohen, and semantic similarity based on string-edit distances.
    >>
    >>
    >>
    >> Thank you very much.
    >>
    >> Jana
    >>
    >>
    >>
    >> ____________________
    >>
    >> Jana Diesner
    >> Carnegie Mellon University
    >>
    >> jdiesner@andrew.cmu.edu
    >
    >



    This archive was generated by hypermail 2b29 : Thu Jan 20 2005 - 20:24:25 MET