Re: Corpora: code for random selection of concordance lines

From: Rosie Jones (
Date: Thu Mar 21 2002 - 20:56:10 MET

    On Thu, 21 Mar 2002, Tony Berber Sardinha wrote:
    > I wonder if anyone has a bit of perl or java code (or unix utilities)
    > for drawing an x number of lines at random from a concordance?

    I was going to post this as a private reply, then remembered that I began
    programming in perl after someone sent a snippet of perl for word-counting
    to the corpora list a number of years ago, and thought someone else might
    benefit in the same way...

    Assuming the concordance is small enough to fit in memory, the following
    code should work (though admittedly not tested with DOS line-breaks):

    --- begin perl code
    $numlinestoselect = shift; # get the number of lines from the command line
    $myfile = "myconcordance.txt"; # could also get this from the command line
    open(CONCORDANCE, $myfile) || die "Cannot open concordance file
    @lines = <CONCORDANCE>; # read ALL lines into memory;
    close(CONCORDANCE); # just to be tidy
    shift @lines; # get rid of the first line
    $totallines = scalar(@lines); # find out how many lines there are
    if ($totallines < $numlinestoselect) { die "Can't select more lines than
    there are\n" };
    $linessampled = 0;
    srand; # seed the random number generator
    while ($linessampled < $numlinestoselect) {
      $rand = rand($totallines); # pick a line with uniform probability
      if (! $seen[$rand]) { # don't want to select the same line twice
        print $lines[$rand];
        $seen[$rand] = 1;
        $linessampled++; # one more line towards our goal
    # end of perl code


    Rosie Jones PhD student School of Computer Science Carnegie Mellon University

