Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: Normand Peladeau (peladeau@simstat.com)
Date: Wed Jan 12 2005 - 20:40:00 MET

  • Next message: Mark Chappell: "RE: [Corpora-List] open-source program tranforming xml tags"

    Sorry if my suggestion is irrelevant or inadequate, but what about creating
    an inverted index of this document collection and using this inverted index
    to retrieved the most similar documents. I just implemented such an
    algorithm and without a lot of efforts spent on speed optimization, I was
    able to compare the similarity of a document to a collection of 100,000
    documents indexed on about 3000 index terms and it took less than 0.4
    seconds to retrieve the most similar documents. Increasing the spread of
    the index or the size of the collection of documents would definitely
    increase the computing speed but it would probably take no more than a
    minute or two to retrieved duplicate documents in your collection.

    Normand Peladeau
    Provalis Research
    www.simstat.com

    At 1/3/2005 07:37 AM, Marc Kupietz wrote:
    >Hi Ralf and William,
    >
    >we use about the same techniques as described by William to identify
    >near duplicates (we call them duplicates, "versions", and "variations")
    >and garbage in newspaper articles (i.e. DMS-dumps of newspaper
    >publishers).
    >
    >In addition we use a (RKR-)hash table to store and lookup the position
    >of n-grams (usually pentagrams). This makes the problem computationally
    >tractable for huge amounts of text (for the DMS-dumps: typically
    >log-linear in time and linear in space).
    >
    >To calculate similarities (i.e. pentagram based overlap ratios) in about
    >1 million texts our current Perl/C-scripts take less than 2 hours on a
    >SunFire V880 (using 2 processors and ca. 10GB of memory).
    >
    >In a second stage we verify the hash-key based similarity matrix based
    >on the actual normalized texts and compute some other information like
    >the overlap of remaining tokens, the number of coherent sequences and
    >the number of transpositions. Depending on the texts and the size of the
    >hash-table the verification itself is not really necessary.
    >
    >Based on the similarity matrix of the second stage we do some clustering
    >and classification, but that is another topic...
    >
    >If anyone is interested in the flex/Perl/C-sources and wants to
    >volunteer as (beta-)tester, please mail me...
    >
    >Best regards and a happy new year!
    >Marc
    >
    >On Thu, 2004-12-23 at 11:01 +0100, William Fletcher wrote:
    > > Hi Ralf,
    > >
    > > You have already received a lot of useful advice. Let me add my own
    > > experience to the pile of suggestions.
    > >
    > > In my paper
    > > "Making the Web More Useful as a Source for Linguistic Corpora"
    > > http://kwicfinder.com/AAACL2002whf.pdf
    > > I discuss some techniques for identifying Identical Documents,
    > > Virtually
    > > Identical Documents (VIDs) and Highly Repetitive Documents (HRDs,
    > > e.g.
    > > newsgroup discussions like this). The methods are unsophisticated
    > > but
    > > effective. Using fingerprinting of entire documents with the MD5
    > > algorithm and a binary comparison tree, normalization and comparison
    > > of
    > > 11,201 document files took only 33 seconds on a 2.4 GB PC.
    > >
    > > A single byte difference remaining in normalized (I stripped all
    > > non-alphanumeric chars except spaces and converted all letters to
    > > lower
    > > case) text will yield different fingerprints. N-gram patterns
    > > ("shingles") help identify both VIDs and HRDs (i.e. document-external
    > > vs. document-external repetition). I was unaware of others' research
    > > in
    > > this area and approached the problem in a manner that is not readily
    > > scalable to huge document collections. As I recall, low values of N
    > > (3-5) sufficed for VIDs (= less processing), while a range of 8-10
    > > was
    > > more powerful for HRDs. My colleague Hans van Halteren has found
    > > patterns of 3-grams useful for author identification and plagiarism
    > > detection.
    > >
    > > Subsequently I discovered an excellent survey of relevant techniques:
    > > "Mining the Web: Analysis of Hypertext and Semi Structured Data" by
    > > Soumen Chakrabarti
    > >
    > > It's clearly written and provides a wealth of useful information and
    > > techniques, including a description of shingling. Wish I had had
    > > this
    > > book 10 years ago ;-) !
    > >
    > > Good luck sorting it out,
    > > Bill
    >
    >--
    >Marc Kupietz Tel. (+49)621-1581-409
    >Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
    >PO Box 101621, 68016 Mannheim, Germany http://www.ids-mannheim.de/



    This archive was generated by hypermail 2b29 : Tue Jan 04 2005 - 21:21:50 MET