Corpora: Summary - code for random selection of concordance lines

From: Tony Berber Sardinha (tony4@uol.com.br)
Date: Fri Mar 22 2002 - 21:32:30 MET

  • Next message: Sean Slattery: "Re: Corpora: code for random selection of concordance lines"

    Dear list members

    Thanks to everyone who so kindly responded to my query:

    Alexander Clark
    David Graff
    Rosie Jones
    Adam Kilgariff
    Bruce Lambert
    Steve Tolkin

    Summary of replies follows:

    ================

    Alexander Clark:

    shuffle.pl < file | head -n

    #!/usr/bin/perl -w
    # shuffle the lines at random
    # Using Fisher-Yates algorithm

    use strict;
    @lines = (<>);
    for ($i = @lines; --$i;){
        $j = int rand($i+1);
        ($lines[$i], $lines[$j]) = ($lines[$j], $lines[$i]);
    }
    print @lines;

    =========

    David Graff

    (number of lines set to 20 :)

        $ tail +1 conc | perl -pe '$r=rand(); s/^/$r /;' | sort -n | head -20 |
    cut -f2- "-d "

    ===========

    Rosie Jones

    #!/usr/bin/perl
    $numlinestoselect = shift; # get the number of lines from the command line
    $myfile = "myconcordance.txt"; # could also get this from the command line
    open(CONCORDANCE, $myfile) || die "Cannot open concordance file
    $myfile\n";
    @lines = <CONCORDANCE>; # read ALL lines into memory;
    close(CONCORDANCE); # just to be tidy
    shift @lines; # get rid of the first line
    $totallines = scalar(@lines); # find out how many lines there are
    if ($totallines < $numlinestoselect) { die "Can't select more lines than
    there are\n" };
    $linessampled = 0;
    srand; # seed the random number generator
    while ($linessampled < $numlinestoselect) {
      $rand = rand($totallines); # pick a line with uniform probability
      if (! $seen[$rand]) { # don't want to select the same line twice
        print $lines[$rand];
        $seen[$rand] = 1;
        $linessampled++; # one more line towards our goal
      }
    }
    # end of perl code

    =============

    Adam Kilgariff

    (number of lines set to 100 :)

    #!/usr/local/bin/perl

    $numwanted=100;
    @rand = sort map(rand(1)." $_", <>);
    for (@rand){
            $x++;
            s/^0.[0-9]+ //;
            print;
    exit if $x==$numwanted;
        }

    ============

    Bruce Lambert

    #!/bin/sh

    IFILE="$1"
    N="$2"

    gawk 'BEGIN {srand()} {print rand(),$0}' $IFILE | sort | gawk
    '{$1="";print}' | head -$N

    On a Unix system that has gawk: Copy this into a file called 'randomize'.
    At the prompt (~>) type:

    ~> chmod +x randomize

    then

    ~> randomize some_input_file N > some_output_file

    N is the number or lines desired in the output. If your system does not
    have gawk, you can download and install it or try awk (you'll need to
    change gawk to awk in the script).

    ============

    Steve Tolkin

    (with reference to Rosie Jones's reply)

    Do NOT use this approach! It can be pathologiclaly slow.
    Consider what happens in the while loop
    if e.g. you ask for 999 samples from a 1000 line file.
    Getting the last few samples can take a very long time,
    as you repeatedly hit lines that have already used chosen.

    Instead use the Fisher-Yates algorithm, described in the
    recent post by Alexander Clark [asc@aclark.demon.co.uk]
    with this same subject.

    Note that Fisher-Yates is not "complete", in that there
    are many possible shuffles that are never returned.
    It is "fair", in that all the
    results have an equal probablility of being chosen.
    Search for "fisher-yates perl abigail"
    for more details, and/or see
    http://www.bumppo.net/lists/fun-with-perl/2000/07/msg00016.html

    ================

    Thanks again to all who took the time to reply.

    cheers
    tony.
    -------------------------------------
    Dr Tony Berber Sardinha
    LAEL, PUC/SP
    (Catholic University of Sao Paulo, Brazil)
    tony4@uol.com.br
    http://lael.pucsp.br/~tony
    [New website]



    This archive was generated by hypermail 2b29 : Fri Mar 22 2002 - 21:55:37 MET