[Corpora-List] Re: enquiry

From: Timothy Baldwin (tbaldwin@csli.stanford.edu)
Date: Fri Nov 14 2003 - 21:50:39 MET

  • Next message: MAVILOS: "[Corpora-List] Summary of responses: Sublanguage/controlled language"

    Hi,

    > My research area is information theory.
    > I have some questions in the scope of your profession and do appreciate if
    > you help me.
    > I know that the frequency of the <<<WORDS>>> in natural languages can be
    > modeled by Zipf's law and there are lots of works in this regard..
    > But I am looking for a counterpart for <<<LETTER FREQUENCY>>> in natural
    > languages.
    >
    > Does Zipf law hold for letter frequency as well?
    >
    > Is there any universal model for letter frequency in natural
    > languages(Something like Zipf law)?
    >
    > Is there any universal model for letter frequency in natural
    > languages(Something like Zipf law)?
    >
    > If so, what are the basic references for this matter?
    >
    > How can I find the letter frequency for natural languages?

    I'm not familiar with any work on letter frequencies, but think that you would
    be likely to observe Zipfian effects in ideogram-based languages such as
    Chinese and English, where the boundary between characters and words is pretty
    fuzzy to begin with. Certainly in looking briefly at English character
    distributions in the WSJ and Brown corpora, the letter distribution is pretty
    linear, but if you then go on to look at N-grams of different order, Zipfian
    effects become more and more pronounced for higher values of N
    (unsurprisingly). I can send on the graphs if you are interested in having a
    look.

    I have taken the liberty of forwarding this message to the CORPORA mailing
    list to see if anyone in the wider community has to say anything on the
    subject. I recommend that you subscribe to the list
    (http://helmer.aksis.uib.no/corpora/welcome.txt) and have discussion of the
    matter take place via the list.

    Tim



    This archive was generated by hypermail 2b29 : Fri Nov 14 2003 - 21:50:28 MET