Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: William Fletcher (fletcher@usna.edu)
Date: Thu Dec 23 2004 - 11:01:17 MET

  • Next message: William Fletcher: "Shingling / Chakrabati Re: [Corpora-List] Q: How to identify duplicates in a large..."

    Hi Ralf,

    You have already received a lot of useful advice. Let me add my own
    experience to the pile of suggestions.

    In my paper
    "Making the Web More Useful as a Source for Linguistic Corpora"
    http://kwicfinder.com/AAACL2002whf.pdf
    I discuss some techniques for identifying Identical Documents, Virtually
    Identical Documents (VIDs) and Highly Repetitive Documents (HRDs, e.g.
    newsgroup discussions like this). The methods are unsophisticated but
    effective. Using fingerprinting of entire documents with the MD5
    algorithm and a binary comparison tree, normalization and comparison of
    11,201 document files took only 33 seconds on a 2.4 GB PC.

    A single byte difference remaining in normalized (I stripped all
    non-alphanumeric chars except spaces and converted all letters to lower
    case) text will yield different fingerprints. N-gram patterns
    ("shingles") help identify both VIDs and HRDs (i.e. document-external
    vs. document-external repetition). I was unaware of others' research in
    this area and approached the problem in a manner that is not readily
    scalable to huge document collections. As I recall, low values of N
    (3-5) sufficed for VIDs (= less processing), while a range of 8-10 was
    more powerful for HRDs. My colleague Hans van Halteren has found
    patterns of 3-grams useful for author identification and plagiarism
    detection.

    Subsequently I discovered an excellent survey of relevant techniques:
    "Mining the Web: Analysis of Hypertext and Semi Structured Data" by
    Soumen Chakrabarti

    It's clearly written and provides a wealth of useful information and
    techniques, including a description of shingling. Wish I had had this
    book 10 years ago ;-) !

    Good luck sorting it out,
    Bill

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Sending an attachment? See below.
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    AssocProf William H. Fletcher
    Language Studies Department
    United States Naval Academy

    On sabbatical AcYear 2004-2005 with the
    Language and Speech research group at the
    Radboud University of Nijmegen (NL).

    Address
    Groenestraat 105
    6679EC Oosterhout gem. Nijmegen
    Netherlands

    Telephone
    +31-24-3430960 [home]
    +31-24-3615521 [office]
    +31-24-3612907 [fax]
    [6 hours later than US Eastern Time]
     
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Department
       http://usna.edu/LangStudy/
    Phrases in English
       http://pie.usna.edu/
    KWiCFinder
       http://kwicfinder.com/
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Our mail server deletes messages with
      certain kinds of attachments without
      notifying the sender or recipient.

      If sending a .doc, .exe or .zip file, please
      rename it to delete the extension before
      sending and let me know in the body
      of the message what kind of file it is.
    >>> "Ralf Steinberger" <ralf.steinberger@jrc.it> 12/22/04 11:45 AM >>>
    We are facing the task of having to find duplicate and near-duplicate
    documents in a collection of about 1 million texts. Can anyone give us
    advice on how to approach this challenge?
     
    The documents are in various formats (html, PDF, MS-Word, plain text,
    ...)
    so that we intend to first convert them to plain text. It is possible
    that
    the same text is present in the document collection in different
    formats.
     
    For smaller collections, we identify (near)-duplicates by applying
    hierarchical clustering techniques, but with this approach, we are
    limited
    to a few thousand documents.
     
    Any pointers are welcome. Thank you.
     
    Ralf Steinberger
    European Commission - Joint Research Centre
    http://www.jrc.it/langtech
     



    This archive was generated by hypermail 2b29 : Thu Dec 23 2004 - 11:10:56 MET