Re: Corpora: Number of distinct words

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Mon Oct 29 2001 - 11:13:16 MET

  • Next message: Bob Frank: "Corpora: Workshop on Tree Adjoining Grammars and Related Frameworks"

    Hi Peter,

    > Thank you for your analysis. I have just a few remarks.
    >
    > >You can then relate the word length
    > >distribution with the file size as:
    > >
    > >File Size = SUM_k [#(k) * (k+1)] = F (1)
    > > ~ mean word length * N (1.1)
    > >
    > >where #(.) is the number of times the argument has appeared
    > >and N is the total number of distinct words.
    > >
    > >If the given relation:
    > >
    > >N = 6 sqrt(F) => N^2 / 36 = F
    >
    > This relation holds between different drafts of the
    > same file (study of a text during its composition).
    > Another particularity is that the text measured is
    > a textbook, which likely has a structure very
    > different from a novel. Does your formula take these
    > two considerations into account?

    Sorry for the misunderstanding. But what
    follows would be relevant.

    > FYI, the text has 8084 distinct words for a file size
    > of 1835191 characters.

    For naturally occurring text, Heap's law says the following
    form:

    N = A F^B

    where N and F are as defined above, B is
    between 0 and 1, and A is another constant. I am not
    sure whether A has to be between 0 and 1 or somewhere outside.
    If A can be larger than 1, then I guess what you have is basically
    Heap's law.

    Best,

    Robert Luk

    > Peter
    >
    > --
    > Peter Van Roy
    > Département d'Ingénierie Informatique
    > (Department of Computing Science and Engineering)
    > Université catholique de Louvain
    > B-1348 Louvain-la-Neuve, Belgium
    >
    > Email: pvr@info.ucl.ac.be
    > Tel: (+32) (10) 47.83.74
    > Web: http://www.info.ucl.ac.be/people/cvvanroy.html
    > Mozart: http://www.mozart-oz.org
    >
    >



    This archive was generated by hypermail 2b29 : Mon Oct 29 2001 - 11:30:31 MET