Re: Corpora: corpus equilibrium?

John Aitchison (jaitchison@acm.org)
Fri, 10 Oct 1997 09:22:07 +0000

>
> I'm not suggesting for a moment that balancing corpora on the basis of word
> frequency in texts is the way to God's truth about language. I agree with
> the sea analogy - to the extent that, yes, it's a bucketful here, a
> bucketful there, and measuring the magnesium content tells me nothing about
> the depth to the sea bottom or what life forms are supported there. But you
> have to agree that taking odd bucketfuls from different parts of the globe
> on the basis of known facts about local conditions makes laboratory
> investigations a squidge more informative than they would be if they were
> based on what could be pumped through a direct pipeline in the Thames estuary.

I am rather interested in the sampling issues here... perhaps this is
not germane to the original problem (which I understood to be that
GIVEN that I have two or more "samples" [datasets, perhaps more
accurately] what can I reasonably do with them?) but I am curious anyway.

The ocean and buckets analogy is interesting. If we were to cast the
problem in a traditional sampling framework we might say that the
universe we wish to sample is "the totality of molecules in liquid
form at some defined instant in some defined container" and that we
propose to undertake a stratified multi stage clustered sample to
estimate the proportion of those molecules that are magnesium.

Corpora clearly represent clustered samples of some sort and
estimates of errors associated with estimates from those samples are
presumably subject to the usual inflation factors. But what
intrigues me more is the definition of the universe, how it is divided
into corpora, how the corpora are selected . If the universe is
defined as "all words in use in English at x time etc" then imho selecting
corpi (?sp) is not analagous (?sp) to selecting buckets from oceans ..
buckets can reasonably be treated as independent samples (of water
molecules). .. and all this presupposes that there exists some
method of selecting corpi at random from all possible corpi.

Since corpuses (corpora ? corpi?) are overlapping sets of words, what
is the uinit of interest here. Is the sampling a sampling of corpi in
order to make statements about corpi, or is a sampling of words (in
which the corpi are just sampling devices) ?

I suppose that the sampling theory of all this has been studied at length,
but the statistical foundations of all this (what is being sampled
from what and how, what is being estimated, what is the nature of
the estimator, what are the errors of the estimand) are unclear to
me.

Intuition compels us toward an agreement with

> But you have to agree that taking odd bucketfuls from different
> parts of the globe on the basis of known facts about local
> conditions makes laboratory investigations a squidge more
> informative than they would be if they were based on what could be
> pumped through a direct pipeline in the Thames estuary.

which is an informal statement to the effect that stratification is
good, independent samples are good, large samples are good ..
but I am not sure that intuition can be trusted here, at least not
without a more formal definition of the-thing-that-is-to-be-estimated
and how-it-is-to-be-estimated. I am not sure that the analogy carries
through, not sure that it might not be misleading or dangerous, not
sure that gut feel is good enough when talking about the sampling
properties of extrema, or ranks, or of test statistics..

My gut feel is that there are complex statistical problems here and
that a thorough working through of the theory and/or some simulation
work is indicated.

Just my $0.02, fwiw

Jean Hudsons original question, in part, and with comments
-------------------------------------------------------------------
jh > Can anyone recommend software (apart from Wordsmith) or computational
jh > methods for doing vocabulary analysis of samples of text to control
jh >the balance within a large text corpus?

jh >I would like to be able to take a sample of, say, 100,000 words and
jh >see how many different word forms there are within it

This is some sort of order statistic. Ignoring how the sample is
drawn, assuming it is just srs (which it probably is not) the of
course you can get a point estimate of the quantity of interest
(Number_Of_Different_WordForms). The sampling distribution of this
estimator is another matter altogether .. it is quite probable that
the variance of the estimate is very high.

jh >. Also, I would
jh >like to see how the word frequencies within the sample match up with a
jh >control list of frequencies taken from a larger mixed-text corpus.

Again, using chi-squared or some other rubric, there is no problem in
making an estimate. The distribution of that estimate .. what you can
say about it (from theory or whatever), about its 'quality', is
something else again

jh >It would also be useful to have a list of the words that occur
jh >significantly more frequently within the sample than they do
jh >within the language as a whole.

Again, it is obviously mechanically straightforward, but the outcome
is of unknown 'quality'. Even on the assumption of srs, you have a
process of doing some very large number ( O(N**2) ) of t tests or
whatever, with all the well known attendant problems. This 'list of
different words' could be expected to be VERY unstable simply as a
result of sampling and the opportunistic search that was undertaken
to build it. An analogy might be with out of sample performance of
classifiers .. reductions in performance from 90% correct on the
training data to just better than chance on a new sample are not
uncommon. So, it would be well to take such a list with a degree of
scepticism.

John Aitchison <jaitchison@acm.org>
Data Sciences Pty Ltd
Sydney, AUSTRALIA.