Re: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Scott Sadowsky (lists@spanishtranslator.org)
Date: Thu Dec 23 2004 - 09:01:22 MET

  • Next message: Ralf Steinberger: "[Corpora-List] Multilingual text analysis - job opening at the EC's Joint Research Centre in 2005"

    On 12/22/2004 12:45 PM, Ralf Steinberger wrote the following:

    >We are facing the task of having to find duplicate and near-duplicate
    >documents in a collection of about 1 million texts. Can anyone give us
    >advice on how to approach this challenge?

    I was facing the same problem a couple years ago, with a corpus of just
    about the same size. The closest off-the-shelf solution I found, a program
    called ABC-View, wasn't ideal because it was designed for multimedia files
    and not text. But on a lark I contacted the developer, Nils Haeck, and
    explained the problem to him.

    After asking me a series of questions about what I needed, he sent me a
    new, custom-built plug-in for his program that implemented a fuzzy text
    comparison algorithm with user-configurable parameters, which he continued
    to refine according to my specifications.

    I have been using this plug-in ever since, and have eliminated several
    hundred thousand duplicate files --both plain text and HTML-- from a corpus
    that now has about 1.3 million documents. An amazingly, it can process the
    entire collection in around a day on a clunky dual PIII 500MHz with 512 MB
    of RAM.

    Besides being a top-notch programmer, Nils is also an extremely altruistic
    soul -- he not only created the plug-in for me without even mentioning
    compensation, but he also gave me a free copy of the program that runs it,
    as I need it for academic purposes. I suggest that anyone who has need of
    such a tool contact him at <n.haeck@simdesign.nl>.

    Cheers,
    Scott

    __________________________________________________________________
    Scott Sadowsky · sadowsky@spanishtranslator.org
    http://www.spanishtranslator.org
    __________________________________________________________________
    "Happiness is a signal that our brains use to motivate us to do certain
    things. And in the same way that our eye adapts to different levels of
    illumination, we're designed to kind of go back to the happiness set point.
    Our brains are not trying to be happy. Our brains are trying to regulate us".
      -- George Loewenstein



    This archive was generated by hypermail 2b29 : Thu Dec 23 2004 - 07:54:07 MET