Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Gregor Erbach (gor@acm.org)
Date: Wed Dec 22 2004 - 20:59:45 MET

  • Next message: Marian Olteanu: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    I know of two publications on the efficient detection of duplicates
    and near-duplicates in large document collections:

    Andrei Z. Broder et al.
    Syntactic Clustering of the Web
    http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/

    US Patent 6658423
    PUGH WILLIAM and HENZINGER MONIKA H
    Google Inc.
    Detecting duplicate and near-duplicate files
    http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US6658423&F=0

    regards,

           Gregor

    Ralf Steinberger wrote:

    > We are facing the task of having to find duplicate and near-duplicate
    > documents in a collection of about 1 million texts. Can anyone give us
    > advice on how to approach this challenge?
    >
    > The documents are in various formats (html, PDF, MS-Word, plain text,
    > ...) so that we intend to first convert them to plain text. It is
    > possible that the same text is present in the document collection in
    > different formats.
    >
    > For smaller collections, we identify (near)-duplicates by applying
    > hierarchical clustering techniques, but with this approach, we are
    > limited to a few thousand documents.
    >
    > Any pointers are welcome. Thank you.
    >
    > Ralf Steinberger
    > European Commission - Joint Research Centre
    > http://www.jrc.it/langtech
    >

    -- 
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Dr. Gregor Erbach                     http://purl.org/net/gregor/
    DFKI GmbH, Language Technology Lab    http://www.dfki.de/
    Tel. +49 (681) 302-5354               mailto:erbach@dfki.de
    



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 23:29:08 MET