Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection

From: Marc Kupietz (kupietz@ids-mannheim.de)
Date: Mon Jan 03 2005 - 13:37:15 MET

  • Next message: hind oukerradi: "[Corpora-List] open-source program tranforming xml tags"

    Hi Ralf and William,

    we use about the same techniques as described by William to identify
    near duplicates (we call them duplicates, "versions", and "variations")
    and garbage in newspaper articles (i.e. DMS-dumps of newspaper
    publishers).

    In addition we use a (RKR-)hash table to store and lookup the position
    of n-grams (usually pentagrams). This makes the problem computationally
    tractable for huge amounts of text (for the DMS-dumps: typically
    log-linear in time and linear in space).

    To calculate similarities (i.e. pentagram based overlap ratios) in about
    1 million texts our current Perl/C-scripts take less than 2 hours on a
    SunFire V880 (using 2 processors and ca. 10GB of memory).

    In a second stage we verify the hash-key based similarity matrix based
    on the actual normalized texts and compute some other information like
    the overlap of remaining tokens, the number of coherent sequences and
    the number of transpositions. Depending on the texts and the size of the
    hash-table the verification itself is not really necessary.

    Based on the similarity matrix of the second stage we do some clustering
    and classification, but that is another topic...

    If anyone is interested in the flex/Perl/C-sources and wants to
    volunteer as (beta-)tester, please mail me...

    Best regards and a happy new year!
    Marc
     
    On Thu, 2004-12-23 at 11:01 +0100, William Fletcher wrote:
    > Hi Ralf,
    >
    > You have already received a lot of useful advice. Let me add my own
    > experience to the pile of suggestions.
    >
    > In my paper
    > "Making the Web More Useful as a Source for Linguistic Corpora"
    > http://kwicfinder.com/AAACL2002whf.pdf
    > I discuss some techniques for identifying Identical Documents,
    > Virtually
    > Identical Documents (VIDs) and Highly Repetitive Documents (HRDs,
    > e.g.
    > newsgroup discussions like this). The methods are unsophisticated
    > but
    > effective. Using fingerprinting of entire documents with the MD5
    > algorithm and a binary comparison tree, normalization and comparison
    > of
    > 11,201 document files took only 33 seconds on a 2.4 GB PC.
    >
    > A single byte difference remaining in normalized (I stripped all
    > non-alphanumeric chars except spaces and converted all letters to
    > lower
    > case) text will yield different fingerprints. N-gram patterns
    > ("shingles") help identify both VIDs and HRDs (i.e. document-external
    > vs. document-external repetition). I was unaware of others' research
    > in
    > this area and approached the problem in a manner that is not readily
    > scalable to huge document collections. As I recall, low values of N
    > (3-5) sufficed for VIDs (= less processing), while a range of 8-10
    > was
    > more powerful for HRDs. My colleague Hans van Halteren has found
    > patterns of 3-grams useful for author identification and plagiarism
    > detection.
    >
    > Subsequently I discovered an excellent survey of relevant techniques:
    > "Mining the Web: Analysis of Hypertext and Semi Structured Data" by
    > Soumen Chakrabarti
    >
    > It's clearly written and provides a wealth of useful information and
    > techniques, including a description of shingling. Wish I had had
    > this
    > book 10 years ago ;-) !
    >
    > Good luck sorting it out,
    > Bill

    -- 
    Marc Kupietz                                       Tel. (+49)621-1581-409
    Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
    PO Box 101621, 68016 Mannheim, Germany        http://www.ids-mannheim.de/
    



    This archive was generated by hypermail 2b29 : Mon Jan 03 2005 - 13:55:07 MET