[Corpora-List] Q: How to identify duplicates in a large document collection

From: Ralf Steinberger (ralf.steinberger@jrc.it)
Date: Wed Dec 22 2004 - 17:45:38 MET

  • Next message: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    We are facing the task of having to find duplicate and near-duplicate
    documents in a collection of about 1 million texts. Can anyone give us
    advice on how to approach this challenge?
     
    The documents are in various formats (html, PDF, MS-Word, plain text, ...)
    so that we intend to first convert them to plain text. It is possible that
    the same text is present in the document collection in different formats.
     
    For smaller collections, we identify (near)-duplicates by applying
    hierarchical clustering techniques, but with this approach, we are limited
    to a few thousand documents.
     
    Any pointers are welcome. Thank you.
     
    Ralf Steinberger
    European Commission - Joint Research Centre
    http://www.jrc.it/langtech
     



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:02:34 MET