RE: [Corpora-List] Q: How to identify duplicates in a large document collection

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Thu Dec 23 2004 - 07:44:14 MET

  • Next message: Scott Sadowsky: "Re: [Corpora-List] Q: How to identify duplicates in a large document collection"

    We recently encountered the problem with the LDC’s English Gigaword corpus:
    many of the stories in this newswire corpus occur repeatedly, with changing
    datelines, often in updated and revised forms. We have also hit the
    question when producing corpora for dictionary-making from the web.

     

    A crucial question in these situations is: what are the objects which might
    be considered duplicates? If two stories share two paragraphs, but each
    have two further paragraphs that are not shared, it is not obvious what
    should be done. Our solution (working with Infogistics Ltd, from Edinburgh)
    heuristically identified ‘paragraphs’ and treated them as the objects which
    might be duplicates. It also looked at successions of paragraphs because,
    firstly, identical short paragraphs may have been produced independently on
    two or more occasions, and secondly, stripping out paragraphs destroys the
    integrity of the text, so we did not want to do it lightly.

     

    I think one set of papers mentioned in earlier responses to the query, which
    used document similarity, won’t help in our scenario but another, which
    looks for longest common substrings (see Alexander Clark’s mail) will.

     

    The interesting theoretical question lurking around here is: when does a
    common expression (essential subject matter for corpus linguistics) turn
    into duplication (which is not wanted). Duplication of the former kind is
    the fabric of language. If I speak in formulae and clichés, as so many of
    us do so much of the time, it is likely that my speaker turns will exactly
    match others’. Quotations are another intermediate case – if someone quotes
    half a sentence from a text that is also in the corpus, you want to leave it
    in. If it is a couple of sentences – maybe. If it is a couple of
    paragraphs or more you may well want to throw it out as duplication. My
    suspicion is, it will always depend on what you want to do with the corpus.

     

    Adam Kilgarriff

    Lexical Computing Ltd

     

     

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Ralf Steinberger
    Sent: 22 December 2004 16:46
    To: List Corpora (Corpora list)
    Subject: [Corpora-List] Q: How to identify duplicates in a large document
    collection

     

    We are facing the task of having to find duplicate and near-duplicate
    documents in a collection of about 1 million texts. Can anyone give us
    advice on how to approach this challenge?

     

    The documents are in various formats (html, PDF, MS-Word, plain text, ...)
    so that we intend to first convert them to plain text. It is possible that
    the same text is present in the document collection in different formats.

     

    For smaller collections, we identify (near)-duplicates by applying
    hierarchical clustering techniques, but with this approach, we are limited
    to a few thousand documents.

     

    Any pointers are welcome. Thank you.

     

    Ralf Steinberger

    European Commission - Joint Research Centre

    http://www.jrc.it/langtech

     



    This archive was generated by hypermail 2b29 : Thu Dec 23 2004 - 07:40:17 MET