Re: Corpora: code for random selection of concordance lines

From: Radu Florian (rflorian@cs.jhu.edu)
Date: Fri Mar 22 2002 - 23:51:52 MET

  • Next message: edwards@ICSI.Berkeley.EDU: "Re: Corpora: code for random selection of concordance lines"

    Rosie Jones wrote:
    > Interestingly, in practice this method is not appreciably slower. For the cited example,
    > selecting 999 lines from a 1000 line concordance, (and doing it 100 times over) doing the
    > fisher yates method on my machine takes 1 second, while the sampling-and-checking method takes
    > 3 seconds. When selecting 999 lines from a 10,000 line concordance (100 times), the
    > sampling-and-checking method takes 1 second, while the fisher yates algorithm (which after all
    > has to shuffle 10,000 lines) takes 14 seconds. Samples of up to 75% of the total concordance
    > size are faster with the sampling-and-checking method.
    >
    > So for large concordance size, with sample size of 10%-75% of concordance size, the apparently
    > more naive algorithm appears to be more efficient. For small concordance size, you probably
    > won't notice the difference in time anyway. For large concordance size and large sample size,
    > the fisher yates method is probably the best way.
    >
    > Rosie Jones.
    >
    > "There's more than one way to do it"
    >
    > On Fri, 22 Mar 2002, Tolkin, Steve wrote:
    >
    >
    >> Do NOT use this approach! It can be pathologiclaly slow. Consider what happens in the while
    >> loop if e.g. you ask for 999 samples from a 1000 line file. Getting the last few samples can
    >> take a very long time, as you repeatedly hit lines that have already used chosen.
    >>
    >> Instead use the Fisher-Yates algorithm, described in the recent post by Alexander Clark
    >> [asc@aclark.demon.co.uk] with this same subject.
    >>
    >
    > [...]

    I've used the following method, which is very close to Rosie Jones', only that it avoids the little
    problem that was mentioned: (I just adapted the previous Perl script for convenience)

    #!/usr/bin/perl
    $numlinestoselect = shift; # get the number of lines from the command line
    $myfile = "myconcordance.txt"; # could also get this from the command line
    open(CONCORDANCE, $myfile) || die "Cannot open concordance file
    $myfile\n";
    @lines = <CONCORDANCE>; # read ALL lines into memory;
    close(CONCORDANCE); # just to be tidy
    shift @lines; # get rid of the first line
    $totallines = scalar(@lines); # find out how many lines there are
    if ($totallines < $numlinestoselect) { die "Can't select more lines than
    there are\n" };
    $linessampled = 0;
    srand; # seed the random number generator
    while ($linessampled < $numlinestoselect) {
       $rand = rand($totallines); # pick a line with uniform probability
       # *** here's the difference
       # *** switch the selected position with the last one in the list
       print $lines[$rand];
       ($lines[$rand], $lines[$totallines]) = ($lines[$totallines], $lines[$rand]);
       $totallines--;
       $linesampled++;
    # if (! $seen[$rand]) { # don't want to select the same line twice
    # print $lines[$rand];
    # $seen[$rand] = 1;
    # $linessampled++; # one more line towards our goal
    # }
    }
    # end of perl code

    -- 
    

    --Radu Florian



    This archive was generated by hypermail 2b29 : Fri Mar 22 2002 - 23:48:27 MET