Re: Corpora: Relative text length ...

From: Jean Veronis (Jean.Veronis@newsup.univ-mrs.fr)
Date: Wed May 01 2002 - 18:14:36 MET DST

Next message: Priscilla Rasmussen: "Corpora: Final Reminder: Summer School in Human Language Technologies (May 10 Deadline)"

Previous message: Alexander Clark: "Re: Corpora: Relative text length ..."
In reply to: Alexander Clark: "Re: Corpora: Relative text length ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The original article is at

http://ojps.aip.org/journal_cgi/dbt?KEY=PRLTAO&Volume=88&Issue=4&jsessionid=1026771020268694431

It has been known for quite a long time that language is all but random.
Take any two texts or corpora and you will find huge deviations in
frequencies from what would be expected if words (or letters or any unit)
were drawn at random.

There is therefore no surprise in the "discovery" that zippers which encode
more frequent sequences with few bytes and spend more bytes only for rare
sequences will have different compression rates on different texts, and
that this fact could be used as a (rough) measure of distance among texts.

What is really surprising, actually, is not so much that some scientists
reinvent (badly) the wheel, but that so much publicity is given to these
rediscoveies (I have seen the information on this discovery on several
lists, letters, web sites, etc.) and that such prestigious journals
(Physical Review Letters) could publish them.

And why in a Physics journal above all? Will the next issue of
Computational Linguistics include our last papers on Positron Annihilation
in Molecules or Magnetic-Field Generation in Plasmas ? I suppose that we
would say stupid things.

--jv

Next message: Priscilla Rasmussen: "Corpora: Final Reminder: Summer School in Human Language Technologies (May 10 Deadline)"
Previous message: Alexander Clark: "Re: Corpora: Relative text length ..."
In reply to: Alexander Clark: "Re: Corpora: Relative text length ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed May 01 2002 - 18:35:47 MET DST