Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Shlomo Argamon (argamon@iit.edu)
Date: Wed Dec 22 2004 - 19:41:19 MET

  • Next message: Claudia Sassen: "[Corpora-List] 2nd CfP: Dialogue Modelling and Generation Symposium"

    The people in the IIT IR lab have a recent paper on the topic:
    http://ir.iit.edu/publications/downloads/p171-chowdhury.pdf

    You might contact the authors directly to see if any software is available.

            -Shlomo-

    Mike Maxwell wrote:
    >> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
    >>
    >>> We are facing the task of having to find duplicate and near-duplicate
    >>> documents in a collection of about 1 million texts. Can anyone give
    >>> us advice on how to approach this challenge?
    >
    >
    > We thought about this awhile back, when it turned out we had paid for
    > translation of several pairs of articles where the members of the pair
    > each had different filenames. We didn't implement a solution, but here
    > are some thoughts:
    >
    > Do pairs of similar papers contain basically the same number of words? I
    > would imagine they do, or you wouldn't be calling them "similar".
    >
    > I would then use file size as a heuristic, and only compare each article
    > with a few of its neighbors in size. That might reduce the complexity
    > from N*N to kN, where 'k' is some (hopefully small) constant (and
    > assumign that sorting them by size is not time-consuming, which it
    > certainly shouldn't be).
    >
    > If there is variation in the way paragraphs are indicated (e.g. whether
    > there is a blank line inserted) and inter-sentential spacing (one space
    > character vs. two, maybe), then after converting them to plain text, you
    > might find it necessary to go an additional stage and convert them into
    > some kind of canonical format, such as tokenized. There are other
    > obvious normalizations you might want to apply, too.
    >



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 19:35:59 MET