the problem that you described (given a random variable, estimate
something) is really a subset of the real problem.
the real problem is given a random variable and a number of different
experiments you can do at various specified costs, estimate
something within a fixed budget. stratified sampling is a sort of
subset of this framework in which the cost of each sample is equal.
to make this concrete, suppose that you can get a certain amount of
usenet news text per day, and that you can get a certain amount of
newswire text per dollar (pound, ecu...) and that you can get a
certain much smaller amount of spoken language per dollar and that you
can digitize some amount of out of copyright literary text per dollar.
now further suppose that you have $100,000 and 100 days to build a
corpus. how do you allocate your resources to optimally estimate some
quantity (i.e. the frequency of the word "bank")? and how do you
allocate your resources to optimally estimate 10^7 parameters (a
speech recognition language model)?
and finally, given multiple competing goals with specified value
(political value, mostly), how do you come out smelling like a rose?
the answer to this last problem is clearly the easiest.
(SPOILER: you can'twin).
@article{efron87,
author={Bradley Efron and Ronald Thisted},
year=1987,
title={Did Shakespear write a newly discovered poem?},
journal={Biometrika},
volume=74,
pages={445-455}
}
>>>>> "ja" == John Aitchison <jaitchison@acm.org> writes:
[... to balance corpora or not ...]
ja> I am rather interested in the sampling issues here... perhaps
ja> this is not germane to the original problem (which I
ja> understood to be that GIVEN that I have two or more "samples"
ja> [datasets, perhaps more accurately] what can I reasonably do
ja> with them?) but I am curious anyway.