Re: scaling/norming

Adam Kilgarriff (ak28@it-research-institute.brighton.ac.uk)
Tue, 5 Dec 95 16:00:30 GMT

Can I jump on this discussion to ask a couple of "mathematical
beginner" questions of my own:

Ted Dunning's 1993 critique shows we should be very wary of making
assumptions of normality where events are rare, which is, Ted says,
"where np(1-p)<5, and dramatically [so] where np(1-p)<1."

Am I right in thinking, then, that there is no problem where we are
estimating parameters on the basis of more than 5 instances in the
corpus? (So, it is legitimate to use Mutual Info if I'm only going
to be looking at items where the frequency of the collocate is 5 or
more.)

Isn't this likely to be the case for John Milton's work? He says

> I want to compare features (wds, n-grams, POS tags etc) from a corpus of .5
> mil words of the writing of NS speakers of English to a 750,000 wd corpus of
> the writing of NNS speakers. I've been told that proportional or scaled

But if he restricts himself to looking at items that occur more than 5
times in each corpus, a straight comparison of
occurrences-per-thousand-words in the NS and NNS corpora is
legitimate. Yes?

Next question: does this form of objection to using normal
approximations relate in any way to "deviations from Poisson" (as in
Church and Gale's recent work) and the general question of how
word-frequencies are distributed - eg, 'burstily', with different
documents in any corpus having different parameters for most words?

(I don't think so, because the log-likelihood test is based on the
binomial distribution - a one-parameter distribution: the Church-Gale
work (as well as my own) is looking at the inadequacy, for
various purposes, of one-parameter distributions. But I'm getting out of
my depths, and seriously need help...)

Adam Kilgarriff

refs:

@InProceedings{ChurchGale:95,
author = "Kenneth Church and William Gale",
title = "Inverse Document Frequency {(IDF)}: a measure of
deviations from {P}oisson",
booktitle = "Third Workshop on very large corpora",
year = "1995",
editor = "David Yarowsky and Kenneth Church",
organization = "ACL",
address = "{MIT}",
pages = "121--130"
}
@article{ChurchGale:95b,
author = "Kenneth Church and William Gale",
title = "{P}oisson Mixtures",
year = "1995",
journal = "Journal of Natural Language Engineering",
volume = "1",
number = "2",
pages = "163--190"
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT ak28@itri.bton.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%