Re: Sampling

Adam Kilgarriff (ak28@it-research-institute.brighton.ac.uk)
Thu, 7 Dec 95 10:21:59 GMT

I do random samples like this:

put each object which you want a sample of (eg, sentence, paragraph,
or identifier for same) on a line. (This will be the difficult bit,
but depends entirely on the format/markup of the corpus and the types
of units you want to sample, so it's not possible to give general
help.)

then, in unix, (mks-awk, gawk or nawk will do this, though the basic-grade awk
on my system won't; all Unixes come armed with nawk, I think)

gawk '{print rand(),$0}' infile | sort | gawk '{sub($1 " ", "");print}'> random
file

and the sorted file is now randomly ordered so, eg,

head -50 randomfile > outfile2

gives you a random sample, size 50, and

head -100 randomfile | tail -50 > outfile3

gives you another.

adam

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT ak28@itri.bton.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%