[Corpora-List] Q: How to identify duplicates in a large document collection

From: Ralf Steinberger (ralf.steinberger@jrc.it)
Date: Wed Dec 22 2004 - 17:45:38 MET

Next message: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

Previous message: miles@inf.ed.ac.uk: "[Corpora-List] Forty PhD studentships at Edinburgh"
Next in thread: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Tom Emerson: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Marian Olteanu: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Adam Kilgarriff: "RE: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Scott Sadowsky: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: William Fletcher: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

We are facing the task of having to find duplicate and near-duplicate
documents in a collection of about 1 million texts. Can anyone give us
advice on how to approach this challenge?

The documents are in various formats (html, PDF, MS-Word, plain text, ...)
so that we intend to first convert them to plain text. It is possible that
the same text is present in the document collection in different formats.

For smaller collections, we identify (near)-duplicates by applying
hierarchical clustering techniques, but with this approach, we are limited
to a few thousand documents.

Any pointers are welcome. Thank you.

Ralf Steinberger
European Commission - Joint Research Centre
http://www.jrc.it/langtech

This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:02:34 MET