Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Dec 22 2004 - 18:23:34 MET

  • Next message: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    I found several papers about this topic working backwards and sideways
    from:

    On the Evolution of Clusters of Near-Duplicate Web Pages
    Dennis Fetterly; Mark Manasse; Marc Najork
    http://research.microsoft.com/research/pubs/view.aspx?type=Publication&id=1096

    However, I am curious if there is somebody on this list who actually
    implemented a method such as the one described in this paper (based on
    fingerprints of fingerprints of ``shingles'', as they call word
    sequences...), and could provide more concrete advice about this important
    issue.

    Regards,

    Marco

    On Wed, 22 Dec 2004, Ralf Steinberger wrote:

    > We are facing the task of having to find duplicate and near-duplicate
    > documents in a collection of about 1 million texts. Can anyone give us
    > advice on how to approach this challenge?



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:38:21 MET