Corpora: software for sampling and analysing corpus

Jean Hudson (jhudson@cup.cam.ac.uk)
Wed, 8 Oct 1997 15:23:42 +0100

Can anyone recommend software (apart from Wordsmith) or computational
methods for doing vocabulary analysis of samples of text to control the
balance within
a large text corpus?

I would like to be able to take a sample of, say, 100,000 words and
see how many different word forms there are within it. Also, I would
like to see how the word frequencies within the sample match up with
a control list of frequencies taken from a larger mixed-text corpus.

It would also be useful to have a list of the words that occur
significantly more frequently within the sample than they do
within the language as a whole.

I also have two practical questions:

-Is there any point in ignoring Proper Nouns for the purposes of this
kind of analysis (so as not to discriminate too much against text
from sources such as newspapers)?

-What is the minimum sample size that would give useable information?
Is 100,000 too big? too small? just right?

Jean
----------------------
Ms Jean Hudson
Research Editor
Cambridge University Press & University of Nottingham
Spoken English Corpus Project

email: jhudson@cup.cam.ac.uk
phone: +44-1223-325123

mail address:
Cambridge University Press
Publishing Division
The Edinburgh Building
Shaftesbury Road
Cambridge CB2 2RU
UK