Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Wed Dec 22 2004 - 19:15:34 MET

Next message: Tom Emerson: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

Previous message: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
>
>> We are facing the task of having to find duplicate and near-duplicate
>> documents in a collection of about 1 million texts. Can anyone give us
>> advice on how to approach this challenge?

We thought about this awhile back, when it turned out we had paid for
translation of several pairs of articles where the members of the pair
each had different filenames. We didn't implement a solution, but here
are some thoughts:

Do pairs of similar papers contain basically the same number of words?
I would imagine they do, or you wouldn't be calling them "similar".

I would then use file size as a heuristic, and only compare each article
with a few of its neighbors in size. That might reduce the complexity
from N*N to kN, where 'k' is some (hopefully small) constant (and
assumign that sorting them by size is not time-consuming, which it
certainly shouldn't be).

If there is variation in the way paragraphs are indicated (e.g. whether
there is a blank line inserted) and inter-sentential spacing (one space
character vs. two, maybe), then after converting them to plain text, you
might find it necessary to go an additional stage and convert them into
some kind of canonical format, such as tokenized. There are other
obvious normalizations you might want to apply, too.

-- 
	Mike Maxwell
	Linguistic Data Consortium
	maxwell@ldc.upenn.edu

Next message: Tom Emerson: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Previous message: Alexander Clark: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
In reply to: Bruce L. Lambert, Ph.D.: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Next in thread: Marco Baroni: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Reply: Shlomo Argamon: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Dec 22 2004 - 19:23:25 MET