Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Tom Emerson (tree@basistech.com)
Date: Wed Dec 22 2004 - 19:15:32 MET

  • Next message: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    Rolf,

    The work of Broder et al. published at WWW6 a common root for many
    duplicate document detection algorithms,

    Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey
    Zweig. 1997. "Syntactic Clustering of the Web". In Proceedings of the
    6th World Wide Web Conference (WWW6).
    http://decweb.ethz.ch/WWW6/Technical/Paper205/Paper205.html

    There has been quite a bit of work following on from the shingle
    fingerprinting proposed in that original paper: there are 113
    citations listed in CiteSeer.

    We have been experimenting with various techniques for identifying
    similar content on large, multilingual document collections harvested
    from the Web, but are not ready to present any results.

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 19:23:28 MET