Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Marian Olteanu (mou_softwin@yahoo.com)
Date: Thu Dec 23 2004 - 06:58:10 MET

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] Q: How to identify duplicates in a large document collection"

    Sorry I don't have time to read the papers recomended, but if I would be in your shoes and I would
    look for perfect match (detect not similar documents, but identical documents), I would compute
    MD5 for each document and then I will look for duplicates. If I would encounter duplicates, I
    would do a comparison between the two documents. This algorithm is practically O(n) + O(m*m)
    (m=number of duplicate documents in the collection of n documents), because the probability to
    encounter the same MD5 value for two different documents is very-very low (with a extremely high
    probability, you will encounter no more than one false positive in MD5 comparison).
    Because you have different document types, I would convert them all to a common format before
    extracting MD5 value (i.e: extract text, keep only letters and digits (ignore punctuation and
    spaces), uppercase everything)

    --- Ralf Steinberger <ralf.steinberger@jrc.it> wrote:

    > We are facing the task of having to find duplicate and near-duplicate
    > documents in a collection of about 1 million texts. Can anyone give us
    > advice on how to approach this challenge?
    >
    > The documents are in various formats (html, PDF, MS-Word, plain text, ...)
    > so that we intend to first convert them to plain text. It is possible that
    > the same text is present in the document collection in different formats.
    >
    > For smaller collections, we identify (near)-duplicates by applying
    > hierarchical clustering techniques, but with this approach, we are limited
    > to a few thousand documents.
    >
    > Any pointers are welcome. Thank you.
    >
    > Ralf Steinberger
    > European Commission - Joint Research Centre
    > http://www.jrc.it/langtech
    >
    >

    =====
    Marian
    http://www.utdallas.edu/~mgo031000/

                    
    __________________________________
    Do you Yahoo!?
    Yahoo! Mail - 250MB free storage. Do more. Manage less.
    http://info.mail.yahoo.com/mail_250



    This archive was generated by hypermail 2b29 : Thu Dec 23 2004 - 07:06:10 MET