Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Tom Emerson (tree@basistech.com)
Date: Wed Dec 22 2004 - 19:15:32 MET

Next message: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

Previous message: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Marian Olteanu: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Rolf,

The work of Broder et al. published at WWW6 a common root for many
duplicate document detection algorithms,

Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey
Zweig. 1997. "Syntactic Clustering of the Web". In Proceedings of the
6th World Wide Web Conference (WWW6).
http://decweb.ethz.ch/WWW6/Technical/Paper205/Paper205.html

There has been quite a bit of work following on from the shingle
fingerprinting proposed in that original paper: there are 113
citations listed in CiteSeer.

We have been experimenting with various techniques for identifying
similar content on large, multilingual document collections harvested
from the Web, but are not ready to present any results.

-tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Next message: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Previous message: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Marian Olteanu: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 19:23:28 MET