Re: [Corpora-List] entropy of text

From: J R Elliott (jre@comp.leeds.ac.uk)
Date: Wed Feb 19 2003 - 18:51:51 MET

  • Next message: Pascale Sebillot: "[Corpora-List] Research positions at INRIA, France"

    Dinoj,

    I have published many papers now, which includes the entropy values and
    characteristics of many languages representing most of the languages
    families.

    I assume you are referring to the binomial equation for the maximum
    sample size appropriate for a given entropic order.

    My web site lists recent publications but the most recent relevant paper,
    which
    includes this except is:
    Elliott, John. Detecting Languageness in: Proceedings of 6th World
    Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2002),
    IX, pp. 323-328. 2002, Orlando, Florida, USA.

    To calculate the sample size required for a given entropic order,
    the binomial equation is: N(r) = n!/r!(n - r)!,
    where n = the number of symbols or patterns
    and r = the entropic order.

    Hope this helps,

    John
    *********************************************************
    John Elliott
    Centre for Computer Analysis of Language and Speech
    University of Leeds. http://www.comp.leeds.ac.uk/jre/
    and Computational Intelligence Group, School of Computing
    Leeds Metropolitan University
    email: jre@comp.leeds.ac.uk or J.Elliott@lmu.ac.uk
    Home: 0113 286 6517 john.elliott@leedsalumni.org.uk
    *********************************************************

    On Tue, 18 Feb 2003, Dinoj Surendran wrote:

    > Hello everyone,
    >
    > Suppose you have a text involving C character types and N character tokens
    > (so for a large book C would be under 50 and N several thousands/millions)
    > and you want to compute the entropy of the text. Suppose further that you're
    > doing this by finding the limit of H_k/k for large k, where H_k is the
    > entropy of k-grams of the text. Naturally you can't take k very large if N
    > is small.
    >
    > Can anyone point me to some good references on how large one can take k to
    > be for a given C and N (and possibly other factors)? I'm looking at C=40
    > and N=80 000.
    >
    > Thanks,
    >
    > Dinoj Surendran
    > Graduate Student
    > Computer Science Dept
    > University of Chicago
    >
    > PS - while I'm here, does anyone know of any online, freely available,
    > large (>50 000) corpora of phoneme-transcribed spontaneous conversation?
    >
    > I've got the switchboard one for American English.
    > http://www.isip.msstate.edu/projects/switchboard/
    > which has 80 000 phonemes syllabified into about 30 000 syllables.
    >
    > Similar corpora for any language would be useful.
    >
    >
    >

    -- 
    *********************************************************
    John Elliott
    Centre for Computer Analysis of Language and Speech
    University of Leeds.  http://www.comp.leeds.ac.uk/jre/
    and Computational Intelligence Group, School of Computing
    Leeds Metropolitan University
    email:  jre@comp.leeds.ac.uk  or J.Elliott@lmu.ac.uk
    Home: 0113 286 6517 john.elliott@leedsalumni.org.uk
    *********************************************************
    



    This archive was generated by hypermail 2b29 : Wed Feb 19 2003 - 19:12:24 MET