[Corpora-List] n-grams (follow-up question)

From: Dirk Ludtke (dludtke@pine.kuee.kyoto-u.ac.jp)
Date: Wed Aug 28 2002 - 08:13:15 MET DST

  • Next message: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"

    A slightly related question:

    I am wondering if anyone could point me to work on n-gram reoccurance.

    A word (or n-gram) occurs k times in a corpus of n words. What is the
    probability that this word occurs again?

    Especially for small k, this probability seems to depend not only on k
    and n, but also on the ratio of words with low and high frequency.

    Is there a nice way to approximate these probabilities. Maybe with
    probability distributions? Is there a mathematic theory?

    Thank you.



    This archive was generated by hypermail 2b29 : Wed Aug 28 2002 - 08:24:08 MET DST