Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: Normand Peladeau (peladeau@simstat.com)
Date: Wed Jan 05 2005 - 14:00:35 MET

  • Next message: Alex Murzaku: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"

    At 1/5/2005 05:59 AM, you wrote:
    >Once the index construction is complete the lookup of
    >(near) duplicates of a single document certainly takes almost no time.
    >What actually takes 2 hours for 1.000.000 documents is the construction
    >of the index and the computation of a complete similarity matrix (the
    >output is certainly constrained by some minimum overlap ratio...) for
    >all documents.

    Sorry! I thought you meant that it took 2 hours to find documents similar
    to a single one once the index was created. Indeed creating the initial
    index can take several hours. Once created, computing similarities should
    be pretty fast.

    Normand Peladeau
    Provalis Research
    www.simstat.com



    This archive was generated by hypermail 2b29 : Wed Jan 05 2005 - 13:58:13 MET