On Thu, 21 Mar 2002, Tony Berber Sardinha wrote:
> I wonder if anyone has a bit of perl or java code (or unix utilities)
> for drawing an x number of lines at random from a concordance?
[...]
I was going to post this as a private reply, then remembered that I began
programming in perl after someone sent a snippet of perl for word-counting
to the corpora list a number of years ago, and thought someone else might
benefit in the same way...
Assuming the concordance is small enough to fit in memory, the following
code should work (though admittedly not tested with DOS line-breaks):
--- begin perl code
#!/usr/bin/perl
$numlinestoselect = shift; # get the number of lines from the command line
$myfile = "myconcordance.txt"; # could also get this from the command line
open(CONCORDANCE, $myfile) || die "Cannot open concordance file
$myfile\n";
@lines = <CONCORDANCE>; # read ALL lines into memory;
close(CONCORDANCE); # just to be tidy
shift @lines; # get rid of the first line
$totallines = scalar(@lines); # find out how many lines there are
if ($totallines < $numlinestoselect) { die "Can't select more lines than
there are\n" };
$linessampled = 0;
srand; # seed the random number generator
while ($linessampled < $numlinestoselect) {
$rand = rand($totallines); # pick a line with uniform probability
if (! $seen[$rand]) { # don't want to select the same line twice
print $lines[$rand];
$seen[$rand] = 1;
$linessampled++; # one more line towards our goal
}
}
# end of perl code
---Rosie Jones PhD student School of Computer Science Carnegie Mellon University rosie@cs.cmu.edu http://www.cs.cmu.edu/~rosie/
This archive was generated by hypermail 2b29 : Fri Mar 22 2002 - 09:14:26 MET