[Corpora-List] summary n-grams (follow-up question)

From: Dirk Ludtke (dludtke@pine.kuee.kyoto-u.ac.jp)
Date: Thu Aug 29 2002 - 10:55:26 MET DST

  • Next message: Stefan Evert: "[Corpora-List] Perl efficiency (Re: N-gram string extraction)"

    Thanks to the people who answered my question from yesterday. I was
    extremely surprised by the speed the first answers came in (I couldn't
    even type so fast). Since some of the replies were not send to the list
    I would like to post a summary.

    The question was:
    > A word (or n-gram) occurs k times in a corpus of n words.
    > What is the probability that this word occurs again?

    The order is the one I received the replies in. At the end of this email
    I will write a bit more about what I want to do with these
    probabilities.

    ----------------

    Sven C. Martin suggested to use Maximum Likelihood estimates with
    discounting (smoothing) to give probability mass to unobserved events:

    > The so-called Maximum Likelihood estimations of
    > probabilities p(w|h), where w is a word and h is the
    > (n-1)-tuple of predecessor words, are in fact relative
    > frequencies p(w|h) = N(h,w)/N(h), where N(h,w) is the
    > frequency of the n-tuple in some training corpus.

    and pointed to

    > Chapter 4 of F. Jelinek: "Statistical methods for
    > speech recognition", MIT Press, Cambridge, MA, 1997

    and

    > H. Ney et al.: "Statistical language modeling
    > using leaving-one-out" in S. Young and G.
    > Bloothooft: "Corpus-based methods in language and
    > speech processing", Kluwer, Dordrecht, 1997

    It seems to me that the mentioned methods solve a more general problem.
    My question would be answered by getting p(w|h), where w ist the n-gram
    i am interested in and h is empty.

    ----------------

    Stefan Th. Gries wrote

    > Kenneth W. Church's paper called "Empirical estimates
    > of adaptation: The chance of two Noriegas is closer to
    > p/2 that p^2"

    (Kenneth W. Church did also post later. See below)

    ----------------

    Oliver wrote

    > It sounds a bit like something I read recently in Geoff
    > Sampson's "Empirical Linguistics", where he describes
    > the Good-Turing method for estimating the probabilities
    > of event that haven't occurred yet. On the way this gives
    > corrected probabilities for things that occurred
    > only once, as their probability might in fact be for
    > them to occur only a fraction of one, which, however,
    > is not observable.

    ----------------

    Ken Church points to two of his papers which are available at his web
    page http://www.research.att.com/~kwc/

    > (1) Poisson Mixtures, and
    > (2) Empirical Estimates of Adaptation: The chance
    > of Two Noriega's is closer to p/2 than p^2

    He also links to a paper of Ronald Rosenfeld about adaption
    > http://citeseer.nj.nec.com/rosenfeld96maximum.html

    -----------------

    Thank you very much again. I will have enough material to read (at
    least) for the next days :)

    Maybe I should also write a bit about the application. I want to use
    these probabilities for examining the quality of different language
    patterns in classification problems. A pattern could be an n-gram but
    also a combination of words with pos-tags or textformat information.

    An easy example: I want to decide whether a word is a noun or not. I
    have a pos-tagged corpus and extract how often different patterns (like
    n-grams with different n) led to nouns and how often not. The question
    is, which particular patterns are better than others.

    As a score for the patterns, I am using the information gain (entropy).
    But I have the feeling that it is not enough to approximate the
    probability of the classes but also the probability of the pattern.

    Big patterns (like 5-grams) tend to have a good probability, even in
    case we have seen them only once or twice. This leads to a good
    information gain, totally ignoring that we will not likely see these
    patterns again. A 5-gram which occured 2 times has a much lower
    probability than a 2-gram which occured also 2 times.

    Dirk Ludtke



    This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 11:06:52 MET DST