Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: Normand Peladeau (peladeau@simstat.com)
Date: Wed Jan 05 2005 - 14:00:35 MET

Next message: Alex Murzaku: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"

Previous message: Normand Peladeau: "[Corpora-List] corpora@hd.uib.no"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

At 1/5/2005 05:59 AM, you wrote:
>Once the index construction is complete the lookup of
>(near) duplicates of a single document certainly takes almost no time.
>What actually takes 2 hours for 1.000.000 documents is the construction
>of the index and the computation of a complete similarity matrix (the
>output is certainly constrained by some minimum overlap ratio...) for
>all documents.

Sorry! I thought you meant that it took 2 hours to find documents similar
to a single one once the index was created. Indeed creating the initial
index can take several hours. Once created, computing similarities should
be pretty fast.

Normand Peladeau
Provalis Research
www.simstat.com

Next message: Alex Murzaku: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"
Previous message: Normand Peladeau: "[Corpora-List] corpora@hd.uib.no"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jan 05 2005 - 13:58:13 MET