Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Wed Dec 22 2004 - 19:15:34 MET

  • Next message: Tom Emerson: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    > At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
    >
    >> We are facing the task of having to find duplicate and near-duplicate
    >> documents in a collection of about 1 million texts. Can anyone give us
    >> advice on how to approach this challenge?

    We thought about this awhile back, when it turned out we had paid for
    translation of several pairs of articles where the members of the pair
    each had different filenames. We didn't implement a solution, but here
    are some thoughts:

    Do pairs of similar papers contain basically the same number of words?
    I would imagine they do, or you wouldn't be calling them "similar".

    I would then use file size as a heuristic, and only compare each article
    with a few of its neighbors in size. That might reduce the complexity
    from N*N to kN, where 'k' is some (hopefully small) constant (and
    assumign that sorting them by size is not time-consuming, which it
    certainly shouldn't be).

    If there is variation in the way paragraphs are indicated (e.g. whether
    there is a blank line inserted) and inter-sentential spacing (one space
    character vs. two, maybe), then after converting them to plain text, you
    might find it necessary to go an additional stage and convert them into
    some kind of canonical format, such as tokenized. There are other
    obvious normalizations you might want to apply, too.

    -- 
    	Mike Maxwell
    	Linguistic Data Consortium
    	maxwell@ldc.upenn.edu
    



    This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 19:23:25 MET