Re: Corpora: software for sampling and analysing corpus

Ted Pedersen (pedersen@seas.smu.edu)
Wed, 8 Oct 1997 11:26:39 -0500 (CDT)

> Can anyone recommend software (apart from Wordsmith) or computational
> methods for doing vocabulary analysis of samples of text to control the
> balance within
> a large text corpus?
>
> I would like to be able to take a sample of, say, 100,000 words and
> see how many different word forms there are within it. Also, I would
> like to see how the word frequencies within the sample match up with
> a control list of frequencies taken from a larger mixed-text corpus.
>
> It would also be useful to have a list of the words that occur
> significantly more frequently within the sample than they do
> within the language as a whole.

If you think you might expand your study beyond what you describe above
into more general areas/issues of statistics then it might not be a bad
idea to consider using a fairly general purpose statistical package such
as SAS, SPSS or Splus. (These are all commercial products but it seems
common for universities to have at least one of these available).

What you describe above can be very easily done in SAS. In addition to
many statistical gadgets, SAS has very convenient features for reading,
merging, and updating data samples. It's also very good at taking random
samples of data and doing all sorts of significance tests. I have SAS
code that does some of what you would like to do and would be happy to
send it if you go that route.

I should also mention Ted Dunning's suite of tools. I've used those for
frequency counts of words and bigrams as well as a number of other
lexically oriented tasks and they work very nicely. I don't have a URL for
this unfortunately but I'm sure somone else will.

>
> I also have two practical questions:
>
> -Is there any point in ignoring Proper Nouns for the purposes of this
> kind of analysis (so as not to discriminate too much against text
> from sources such as newspapers)?
>
> -What is the minimum sample size that would give useable information?
> Is 100,000 too big? too small? just right?
>

I'm not sure about the first point. In regards to the second, the
"best" sample size probably depends a great deal on what you regard as
usable information. I'd suggest taking a fairly large number of 100,000
word samples and seeing how much whatever you are interested in varies
from sample to sample. If it varies quite a bit then you may need to
contemplate a larger sample. If it doesn't vary at all you might be able
to reduce the sample size.

Good luck!
Ted

-- 
* Ted Pedersen                     pedersen@seas.smu.edu              * 
*                                  http://www.seas.smu.edu/~pedersen/ *
* Department of Computer Science and Engineering,                     *
* Southern Methodist University, Dallas, TX 75275      (214) 768-3712 *