scaling/norming

lcjohn@usthk.ust.hk ("lcjohn@usthk.ust.hk")
Thu, 30 Nov 1995 23:11:00 +0800

What's the doctrine on comparing corpora of different sizes?

I want to compare features (wds, n-grams, POS tags etc) from a corpus of .5
mil words of the writing of NS speakers of English to a 750,000 wd corpus of
the writing of NNS speakers. I've been told that proportional or scaled
comparisons is inadvisable (presumably since wd freqs can't be predicted
proportionally (because of Zipf's curve??). Am I left with no alternative
but to throw away materials from the smaller corpus?

John Milton
HKUST