RE: Corpora: Number of distinct words

From: Alexander Gelbukh (gelbukh@cic.ipn.mx)
Date: Sat Oct 27 2001 - 02:16:41 MET DST

  • Next message: Giorgio Parisi: "Re: Corpora: Number of distinct words"

    Dear colleagues,

    Maybe the following paper is relevant:

    See
    http://www.cic.ipn.mx/~gelbukh/CV/Publications/2001/CICLing-2001-Zipf.htm.

    Thank you!
    Alexander

    > -----Original Message-----
    > From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
    > Behalf Of Granger Sylviane
    > Sent: Thursday, October 25, 2001 1:17 AM
    > To: CORPORA@HD.UIB.NO
    > Subject: Corpora: Number of distinct words
    >
    >
    > Dear list members,
    >
    > Could anyone help me answer the following message which I've
    > just received
    > from a colleague of mine in the Computer Science Department?
    >
    > Many thanks.
    >
    > Have a good day!
    > Sylviane Granger
    >
    > >Since about 1.5 years, a colleague and I have been writing a textbook
    > >on computer programming. I have kept numerous drafts of the
    > book during
    > >this period. Today I was curious to see how these drafts evolved. I
    > >graphed the number of distinct 'words' (character sequences delimited
    > >by noncharacters) as a function of file size. I found that
    > a good fit
    > >is given by the square root function:
    > >
    > > (number of distinct words) = 6 * sqrt(file size)
    > >
    > >Is this an example of a general law? I.e., if the text just repeated
    > >the same over and over the exponent would be zero. If the text was a
    > >long catalogue of facts the exponent would be one. The exponent is
    > >exactly half way in between. Is it because of the structure of the
    > >book (the effort to make it coherent)? I don't know. Any
    > comments or
    > >reactions welcome!
    > >
    > >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
    > >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
    > >Is the square root a consequence of Zipf's Law? Or is there
    > more going
    > >on?
    > >
    > >Peter Van Roy
    >
    >
    > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    > Professor Sylviane Granger
    > Université Catholique de Louvain
    > Centre for English Corpus Linguistics
    > Collège Erasme
    > Place Blaise Pascal 1
    > B-1348 Louvain-la-Neuve
    > Belgium
    > Fax: + 3210474942
    > http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
    >
    >



    This archive was generated by hypermail 2b29 : Sat Oct 27 2001 - 02:23:46 MET DST