Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: Alex Murzaku (lissus@gmail.com)
Date: Wed Jan 05 2005 - 15:21:00 MET

Next message: Iddo Greental: "[Corpora-List] Q: Translators seeking ideas for terminology research"

Previous message: Normand Peladeau: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
In reply to: William Fletcher: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Next in thread: Marc Kupietz: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Next in thread: Normand Peladeau: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I would suggest using Lucene (http://jakarta.apache.org/lucene) which
is a transparent scaleable open source search engine library written
in Java. Not only can you find duplicates (everything that has close
to 100% similarity) but you could use it for other corpus search
endeavors. I have actually used it for clustering documents.

On Wed, 05 Jan 2005 06:33:43 -0500, William Fletcher <fletcher@usna.edu> wrote:
> Hi Marc and Normand,
>
> How about sharing your code scripts? I am sure everyone would be grateful for an of-the-shelf solution that could be easily adapted to one's own needs or serve as inspiration for other applications.
>
> Regards,
> Bill
>
> >>> Marc Kupietz <kupietz@ids-mannheim.de> 1/5/2005 5:59:20 AM >>>
> Am Mittwoch, den 12.01.2005, 14:40 -0500 schrieb Normand Peladeau:
> > Sorry if my suggestion is irrelevant or inadequate, but what about creating
> > an inverted index of this document collection and using this inverted index
> > to retrieved the most similar documents. I just implemented such an
> > algorithm and without a lot of efforts spent on speed optimization, I was
> > able to compare the similarity of a document to a collection of 100,000
> > documents indexed on about 3000 index terms and it took less than 0.4
> > seconds to retrieve the most similar documents. Increasing the spread of
> > the index or the size of the collection of documents would definitely
> > increase the computing speed but it would probably take no more than a
> > minute or two to retrieved duplicate documents in your collection.
> >
>
> What you describe (in better terms than I did...) is indeed
> approximately part of what we do: We certainly construct an index, but
> the index keys are not a selection of terms, but hash-keys for all (or
> most - depending on the normalization function, which may delete some
> frequent uncharacteristic words) occurring n-grams (e.g.
> 5-word-sequences). Once the index construction is complete the lookup of
> (near) duplicates of a single document certainly takes almost no time.
> What actually takes 2 hours for 1.000.000 documents is the construction
> of the index and the computation of a complete similarity matrix (the
> output is certainly constrained by some minimum overlap ratio...) for
> all documents.
>
> As you said, this is indeed extremely simple and without optimizations
> and sophisticated hash function my initial Perl-source code was less
> than a screen long. The more surprised I was that apparently (please
> correct me...) nobody did this before.
>
> Best regards
> Marc
>
> --
> Marc Kupietz Tel. (+49) 621/1581-409
> Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
> PO Box 101621, 68016 Mannheim, Germany http://www.ids-mannheim.de/
>
>

Next message: Iddo Greental: "[Corpora-List] Q: Translators seeking ideas for terminology research"
Previous message: Normand Peladeau: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
In reply to: William Fletcher: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Next in thread: Marc Kupietz: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Next in thread: Normand Peladeau: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jan 05 2005 - 15:32:25 MET