Re: Corpora: statistics in CL question

From: Alexander S. Yeh (asy@mitre.org)
Date: Fri Mar 31 2000 - 22:35:23 MET DST

  • Next message: R.Krishnamurthy: "Re: Corpora: that virus!"

    "Alexander S. Yeh" wrote:

    > >In most studies of z-scores and t-scores in computational linguistics,
    > >you tend to find that scores are too high. When you compute scores
    > >for bigrams, for example, you would expect 5% of the scores would be
    > >greater than 1.65, but you tend to find more than that.

    Thanks to Kenneth Church, Ted Dunning, Wessel Kraaij and Mitch Marcus for
    responding to my query on and outside of this list.

    The two basic types of explanation that I received were:

    1. Often in natural language, the rare events happen much more often than
    with a Gaussion (normal) distribution: the distribution tails have much
    more mass than with a Gaussina (normal) distribution.

    2. The tests assume independent samples. Often, this is not true in
    natural language processing. An example is that a content word appearing
    in a document tends to increase the chances of finding that same word
    later on in that document.

    -Alex Yeh



    This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 22:34:43 MET DST