Re: Corpora: corpus equilibrium?

Ted E. Dunning (ted@aptex.com)
Thu, 9 Oct 1997 17:26:54 -0700

see efron87 below for a nice description of a species sampling problem
as applied to language. this is good examination of the simplest form
of the problem.

the problem that you described (given a random variable, estimate
something) is really a subset of the real problem.

the real problem is given a random variable and a number of different
experiments you can do at various specified costs, estimate
something within a fixed budget. stratified sampling is a sort of
subset of this framework in which the cost of each sample is equal.

to make this concrete, suppose that you can get a certain amount of
usenet news text per day, and that you can get a certain amount of
newswire text per dollar (pound, ecu...) and that you can get a
certain much smaller amount of spoken language per dollar and that you
can digitize some amount of out of copyright literary text per dollar.

now further suppose that you have $100,000 and 100 days to build a
corpus. how do you allocate your resources to optimally estimate some
quantity (i.e. the frequency of the word "bank")? and how do you
allocate your resources to optimally estimate 10^7 parameters (a
speech recognition language model)?

and finally, given multiple competing goals with specified value
(political value, mostly), how do you come out smelling like a rose?

the answer to this last problem is clearly the easiest.

(SPOILER: you can'twin).

@article{efron87,
author={Bradley Efron and Ronald Thisted},
year=1987,
title={Did Shakespear write a newly discovered poem?},
journal={Biometrika},
volume=74,
pages={445-455}
}

>>>>> "ja" == John Aitchison <jaitchison@acm.org> writes:

[... to balance corpora or not ...]

ja> I am rather interested in the sampling issues here... perhaps
ja> this is not germane to the original problem (which I
ja> understood to be that GIVEN that I have two or more "samples"
ja> [datasets, perhaps more accurately] what can I reasonably do
ja> with them?) but I am curious anyway.