I found several papers about this topic working backwards and sideways
from:
On the Evolution of Clusters of Near-Duplicate Web Pages
Dennis Fetterly; Mark Manasse; Marc Najork
http://research.microsoft.com/research/pubs/view.aspx?type=Publication&id=1096
However, I am curious if there is somebody on this list who actually
implemented a method such as the one described in this paper (based on
fingerprints of fingerprints of ``shingles'', as they call word
sequences...), and could provide more concrete advice about this important
issue.
Regards,
Marco
On Wed, 22 Dec 2004, Ralf Steinberger wrote:
> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?
This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:38:21 MET