I would like to be able to take a sample of, say, 100,000 words and
see how many different word forms there are within it. Also, I would
like to see how the word frequencies within the sample match up with
a control list of frequencies taken from a larger mixed-text corpus.
It would also be useful to have a list of the words that occur
significantly more frequently within the sample than they do
within the language as a whole.
I also have two practical questions:
-Is there any point in ignoring Proper Nouns for the purposes of this
kind of analysis (so as not to discriminate too much against text
from sources such as newspapers)?
-What is the minimum sample size that would give useable information?
Is 100,000 too big? too small? just right?
Jean
----------------------
Ms Jean Hudson
Research Editor
Cambridge University Press & University of Nottingham
Spoken English Corpus Project
email: jhudson@cup.cam.ac.uk
phone: +44-1223-325123
mail address:
Cambridge University Press
Publishing Division
The Edinburgh Building
Shaftesbury Road
Cambridge CB2 2RU
UK