I want to compare features (wds, n-grams, POS tags etc) from a corpus of .5
mil words of the writing of NS speakers of English to a 750,000 wd corpus of
the writing of NNS speakers. I've been told that proportional or scaled
comparisons is inadvisable (presumably since wd freqs can't be predicted
proportionally (because of Zipf's curve??). Am I left with no alternative
but to throw away materials from the smaller corpus?
John Milton
HKUST