Re: Corpora: Relative text length ...

From: Jean Veronis (
Date: Wed May 01 2002 - 18:14:36 MET DST

  • Next message: Priscilla Rasmussen: "Corpora: Final Reminder: Summer School in Human Language Technologies (May 10 Deadline)"

    The original article is at

    It has been known for quite a long time that language is all but random.
    Take any two texts or corpora and you will find huge deviations in
    frequencies from what would be expected if words (or letters or any unit)
    were drawn at random.

    There is therefore no surprise in the "discovery" that zippers which encode
    more frequent sequences with few bytes and spend more bytes only for rare
    sequences will have different compression rates on different texts, and
    that this fact could be used as a (rough) measure of distance among texts.

    What is really surprising, actually, is not so much that some scientists
    reinvent (badly) the wheel, but that so much publicity is given to these
    rediscoveies (I have seen the information on this discovery on several
    lists, letters, web sites, etc.) and that such prestigious journals
    (Physical Review Letters) could publish them.

    And why in a Physics journal above all? Will the next issue of
    Computational Linguistics include our last papers on Positron Annihilation
    in Molecules or Magnetic-Field Generation in Plasmas ? I suppose that we
    would say stupid things.


    This archive was generated by hypermail 2b29 : Wed May 01 2002 - 18:35:47 MET DST