Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: William Fletcher (fletcher@usna.edu)
Date: Wed Jan 05 2005 - 12:33:43 MET

  • Next message: Normand Peladeau: "[Corpora-List] corpora@hd.uib.no"

    Hi Marc and Normand,

    How about sharing your code scripts? I am sure everyone would be grateful for an of-the-shelf solution that could be easily adapted to one's own needs or serve as inspiration for other applications.

    Regards,
    Bill

    >>> Marc Kupietz <kupietz@ids-mannheim.de> 1/5/2005 5:59:20 AM >>>
    Am Mittwoch, den 12.01.2005, 14:40 -0500 schrieb Normand Peladeau:
    > Sorry if my suggestion is irrelevant or inadequate, but what about creating
    > an inverted index of this document collection and using this inverted index
    > to retrieved the most similar documents. I just implemented such an
    > algorithm and without a lot of efforts spent on speed optimization, I was
    > able to compare the similarity of a document to a collection of 100,000
    > documents indexed on about 3000 index terms and it took less than 0.4
    > seconds to retrieve the most similar documents. Increasing the spread of
    > the index or the size of the collection of documents would definitely
    > increase the computing speed but it would probably take no more than a
    > minute or two to retrieved duplicate documents in your collection.
    >

    What you describe (in better terms than I did...) is indeed
    approximately part of what we do: We certainly construct an index, but
    the index keys are not a selection of terms, but hash-keys for all (or
    most - depending on the normalization function, which may delete some
    frequent uncharacteristic words) occurring n-grams (e.g.
    5-word-sequences). Once the index construction is complete the lookup of
    (near) duplicates of a single document certainly takes almost no time.
    What actually takes 2 hours for 1.000.000 documents is the construction
    of the index and the computation of a complete similarity matrix (the
    output is certainly constrained by some minimum overlap ratio...) for
    all documents.

    As you said, this is indeed extremely simple and without optimizations
    and sophisticated hash function my initial Perl-source code was less
    than a screen long. The more surprised I was that apparently (please
    correct me...) nobody did this before.

    Best regards
     Marc

    -- 
    Marc Kupietz                                      Tel. (+49) 621/1581-409
    Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
    PO Box 101621, 68016 Mannheim, Germany        http://www.ids-mannheim.de/ 
    



    This archive was generated by hypermail 2b29 : Wed Jan 05 2005 - 12:33:12 MET