Re: Corpora: Number of distinct words

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Mon Oct 29 2001 - 02:42:40 MET

  • Next message: Jing-Shin Chang: "Corpora: [NLPRS-01]: 3rd Call For Participation"

    > > Dear list members,

    There is a paper in the journal, Quantitative Linguistics, which
    looks at the distribution of unique word lengths of a number of
    Indo-European languages. Sorry I don't remember the
    exact volume and number (but its near the initial ones).
    You can then relate the word length
    distribution with the file size as:

    File Size = SUM_k [#(k) * (k+1)] = F (1)
              ~ mean word length * N (1.1)

    where #(.) is the number of times the argument has appeared
    and N is the total number of distinct words.

    If the given relation:

    N = 6 sqrt(F) => N^2 / 36 = F

    is substituted into Eq 1.1, then

    mean length of distinct word = N / 36

    which does not sound right.

    Suppose, we have 6,000 distinct words (i.e. N = 6,000),
    then

    F = 36,000,000 / 36 = 1 million bytes.

    This sounds too big from what I know of file sizes of word
    lists. The average word length of English is around 8, so
    that 8 * 6,000 ~ 48k. May be I am missing something.

    Best,

    Robert Luk

    > > Could anyone help me answer the following message which I've just received
    > > from a colleague of mine in the Computer Science Department?
    > >
    > > Many thanks.
    > >
    > > Have a good day!
    > > Sylviane Granger
    > >
    > > >Since about 1.5 years, a colleague and I have been writing a textbook
    > > >on computer programming. I have kept numerous drafts of the book during
    > > >this period. Today I was curious to see how these drafts evolved. I
    > > >graphed the number of distinct 'words' (character sequences delimited
    > > >by noncharacters) as a function of file size. I found that a good fit
    > > >is given by the square root function:
    > > >
    > > > (number of distinct words) = 6 * sqrt(file size)
    > > >
    > > >Is this an example of a general law? I.e., if the text just repeated
    > > >the same over and over the exponent would be zero. If the text was a
    > > >long catalogue of facts the exponent would be one. The exponent is
    > > >exactly half way in between. Is it because of the structure of the
    > > >book (the effort to make it coherent)? I don't know. Any comments or
    > > >reactions welcome!
    > > >
    > > >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
    > > >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
    > > >Is the square root a consequence of Zipf's Law? Or is there more going
    > > >on?
    > > >
    > > >Peter Van Roy
    > >
    > >
    > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    > > Professor Sylviane Granger
    > > Université Catholique de Louvain
    > > Centre for English Corpus Linguistics
    > > Collège Erasme
    > > Place Blaise Pascal 1
    > > B-1348 Louvain-la-Neuve
    > > Belgium
    > > Fax: + 3210474942
    > > http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
    > >
    > >
    > A strict application of the Zipf's Law implies that the number of
    > words is proportional to the log of the file size.
    > My impression is this is what happens if you take novels.
    > Technical books may behave in a different way.
    > Best regards
    >
    > Giorgio
    > -------------------------------------------------------------------------
    > Dipartimento di Fisica Fax +39-06-4463158
    > Universita' di Roma "La Sapienza" giorgio.parisi@roma1.infn.it
    > P.le A. Moro 2 Tel +39-06-49913481
    > Roma, Italy, I-00185 http://chimera.roma1.infn.it/GIORGIO/giorgio.html
    > ------------------------------------------------------------------------
    >
    >
    >



    This archive was generated by hypermail 2b29 : Mon Oct 29 2001 - 03:09:56 MET