Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: Alex Murzaku (lissus@gmail.com)
Date: Wed Jan 05 2005 - 15:21:00 MET

  • Next message: Iddo Greental: "[Corpora-List] Q: Translators seeking ideas for terminology research"

    I would suggest using Lucene (http://jakarta.apache.org/lucene) which
    is a transparent scaleable open source search engine library written
    in Java. Not only can you find duplicates (everything that has close
    to 100% similarity) but you could use it for other corpus search
    endeavors. I have actually used it for clustering documents.

    On Wed, 05 Jan 2005 06:33:43 -0500, William Fletcher <fletcher@usna.edu> wrote:
    > Hi Marc and Normand,
    >
    > How about sharing your code scripts? I am sure everyone would be grateful for an of-the-shelf solution that could be easily adapted to one's own needs or serve as inspiration for other applications.
    >
    > Regards,
    > Bill
    >
    > >>> Marc Kupietz <kupietz@ids-mannheim.de> 1/5/2005 5:59:20 AM >>>
    > Am Mittwoch, den 12.01.2005, 14:40 -0500 schrieb Normand Peladeau:
    > > Sorry if my suggestion is irrelevant or inadequate, but what about creating
    > > an inverted index of this document collection and using this inverted index
    > > to retrieved the most similar documents. I just implemented such an
    > > algorithm and without a lot of efforts spent on speed optimization, I was
    > > able to compare the similarity of a document to a collection of 100,000
    > > documents indexed on about 3000 index terms and it took less than 0.4
    > > seconds to retrieve the most similar documents. Increasing the spread of
    > > the index or the size of the collection of documents would definitely
    > > increase the computing speed but it would probably take no more than a
    > > minute or two to retrieved duplicate documents in your collection.
    > >
    >
    > What you describe (in better terms than I did...) is indeed
    > approximately part of what we do: We certainly construct an index, but
    > the index keys are not a selection of terms, but hash-keys for all (or
    > most - depending on the normalization function, which may delete some
    > frequent uncharacteristic words) occurring n-grams (e.g.
    > 5-word-sequences). Once the index construction is complete the lookup of
    > (near) duplicates of a single document certainly takes almost no time.
    > What actually takes 2 hours for 1.000.000 documents is the construction
    > of the index and the computation of a complete similarity matrix (the
    > output is certainly constrained by some minimum overlap ratio...) for
    > all documents.
    >
    > As you said, this is indeed extremely simple and without optimizations
    > and sophisticated hash function my initial Perl-source code was less
    > than a screen long. The more surprised I was that apparently (please
    > correct me...) nobody did this before.
    >
    > Best regards
    > Marc
    >
    > --
    > Marc Kupietz Tel. (+49) 621/1581-409
    > Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
    > PO Box 101621, 68016 Mannheim, Germany http://www.ids-mannheim.de/
    >
    >



    This archive was generated by hypermail 2b29 : Wed Jan 05 2005 - 15:32:25 MET