Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Bruce L. Lambert, Ph.D. (lambertb@uic.edu)
Date: Wed Dec 22 2004 - 18:46:21 MET

Next message: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

Previous message: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Ralf,

There are non-hierarchical clustering methods that might work. Look for
papers on the "scatter/gather" method. You might also try contacting the
people at Vivisimo.com who have experience clustering very large collections.

There is no quick way to do this. At some point you will have to consider
500 billion or so pairwise similarities. Using an inverted index, you can
avoid computing the zero-valued similarities, but that will still leave a
lot of non-zero similarities to deal with. Good luck.

-bruce

At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
>We are facing the task of having to find duplicate and near-duplicate
>documents in a collection of about 1 million texts. Can anyone give us
>advice on how to approach this challenge?
>
>The documents are in various formats (html, PDF, MS-Word, plain text, ...)
>so that we intend to first convert them to plain text. It is possible that
>the same text is present in the document collection in different formats.
>
>For smaller collections, we identify (near)-duplicates by applying
>hierarchical clustering techniques, but with this approach, we are limited
>to a few thousand documents.
>
>Any pointers are welcome. Thank you.
>
>Ralf Steinberger
>European Commission - Joint Research Centre
><http://www.jrc.it/langtech>http://www.jrc.it/langtech
>

Next message: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Previous message: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Ralf Steinberger: "[Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:38:18 MET