Corpora: Relative text length ...

From: Paul Clough (
Date: Wed May 01 2002 - 13:05:10 MET DST

  • Next message: Alexander Clark: "Re: Corpora: Relative text length ..."

    Dear all,

    I was interested in the discussion regarding relative text length
    and wondered whether this article about text compression
    was related in any way:,1282,50192,00.html

    "In the Jan. 28 issue of the journal Physical Review Letters, three Italian
    scientists used the Unix compression program gzip on text files to address
    such pattern-matching issues as language of composition and authorship."

    "Since data compression entails recognizing and tagging repeated strings,
    the more repeated internal patterns that a file or collection of files has,
    more it can be compressed. Thus, if one wants to know the language in which
    file X was written, just compress it with files whose language is known and
    compare how efficiently each operation is carried out."

    "If, by comparing raw and compressed file sizes, one finds that X plus an
    text file zips tighter than X plus a French text or X plus an English text
    or X plus
    one's other linguistic reference texts, then congratulazioni! You've likely
    just found
    the language of X without even opening it."

    "The scientists -- Dario Benedetto, Emanuele Caglioti and Vittorio Loreto of
    La Sapienza University -- used this technique to discern the language of
    mystery texts
    as small as 20 characters. Furthermore, using a database of 90 texts from 11
    authors, they found their method could even pick out individual authors with
    a success
    rate of 93 percent."

    It might be worth trying whether a simple technique like this could work
    at byte-level)???


    Paul Clough

    Natural Language Processing Group,
    Department of Computer Science,
    University of Sheffield,
    G35 Regent Court,
    211 Portobello Street,
    S1 4DP.

    This archive was generated by hypermail 2b29 : Wed May 01 2002 - 14:47:43 MET DST