Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Alexander Clark (asc@aclark.demon.co.uk)
Date: Wed Dec 22 2004 - 19:00:56 MET

  • Next message: Mike Maxwell: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    One way that I have used on a collection of about 50K documents is to
    construct a suffix array + longest common prefix table. This will allow
    you to identify every "exactly" repeated string in the corpus.
    This can be done in (where n is total length of all documents) O(n log
    n) easily or, or O(n) with a bit of work.
    This assumes that the duplicated portions of the documents are exact.
    If all you are doing is looking for exact duplicates then just compute a
    good hash function for each file and sort them.

    This is related to the problem of finding all maximal repeats in a
    string of length n which can be done in linear time.

    "Algorithms on Strings, Trees and Sequences: Computer Science and
    Computational Biology" by
    Gusfield (CUP) 1997 is the standard textbook for this stuff.

     

    On Wed, 2004-12-22 at 16:45, Ralf Steinberger wrote:
    > We are facing the task of having to find duplicate and near-duplicate
    > documents in a collection of about 1 million texts. Can anyone give us
    > advice on how to approach this challenge?
    >
    > The documents are in various formats (html, PDF, MS-Word, plain text,
    > ...) so that we intend to first convert them to plain text. It is
    > possible that the same text is present in the document collection in
    > different formats.
    >
    > For smaller collections, we identify (near)-duplicates by applying
    > hierarchical clustering techniques, but with this approach, we are
    > limited to a few thousand documents.
    >
    > Any pointers are welcome. Thank you.
    >
    > Ralf Steinberger
    > European Commission - Joint Research Centre
    > http://www.jrc.it/langtech
    >

    -- 
    Alexander Clark asc@aclark.demon.co.uk alexc@cs.rhul.ac.uk
    http://www.cs.rhul.ac.uk/home/alexc/
    Lecturer, Department of Computer Science,
    Royal Holloway, University of London, Egham, Surrey TW20 0EX 
    Direct 01784 443430 Department 01784 434455 Fax 01784 439786
    



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 18:58:26 MET