Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Bruce L. Lambert, Ph.D. (lambertb@uic.edu)
Date: Wed Dec 22 2004 - 18:46:21 MET

  • Next message: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    Ralf,

    There are non-hierarchical clustering methods that might work. Look for
    papers on the "scatter/gather" method. You might also try contacting the
    people at Vivisimo.com who have experience clustering very large collections.

    There is no quick way to do this. At some point you will have to consider
    500 billion or so pairwise similarities. Using an inverted index, you can
    avoid computing the zero-valued similarities, but that will still leave a
    lot of non-zero similarities to deal with. Good luck.

    -bruce

    At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
    >We are facing the task of having to find duplicate and near-duplicate
    >documents in a collection of about 1 million texts. Can anyone give us
    >advice on how to approach this challenge?
    >
    >The documents are in various formats (html, PDF, MS-Word, plain text, ...)
    >so that we intend to first convert them to plain text. It is possible that
    >the same text is present in the document collection in different formats.
    >
    >For smaller collections, we identify (near)-duplicates by applying
    >hierarchical clustering techniques, but with this approach, we are limited
    >to a few thousand documents.
    >
    >Any pointers are welcome. Thank you.
    >
    >Ralf Steinberger
    >European Commission - Joint Research Centre
    ><http://www.jrc.it/langtech>http://www.jrc.it/langtech
    >



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:38:18 MET