Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Dec 22 2004 - 18:23:34 MET

Next message: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

Previous message: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I found several papers about this topic working backwards and sideways
from:

On the Evolution of Clusters of Near-Duplicate Web Pages
Dennis Fetterly; Mark Manasse; Marc Najork
http://research.microsoft.com/research/pubs/view.aspx?type=Publication&id=1096

However, I am curious if there is somebody on this list who actually
implemented a method such as the one described in this paper (based on
fingerprints of fingerprints of ``shingles'', as they call word
sequences...), and could provide more concrete advice about this important
issue.

Regards,

Marco

On Wed, 22 Dec 2004, Ralf Steinberger wrote:

> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?

Next message: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Previous message: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:38:21 MET