RE: Corpora: code for random selection of concordance lines

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Fri Mar 22 2002 - 14:52:34 MET

  • Next message: Tony Berber Sardinha: "Corpora: Summary - code for random selection of concordance lines"

    Do NOT use this approach! It can be pathologiclaly slow.
    Consider what happens in the while loop
    if e.g. you ask for 999 samples from a 1000 line file.
    Getting the last few samples can take a very long time,
    as you repeatedly hit lines that have already used chosen.

    Instead use the Fisher-Yates algorithm, described in the
    recent post by Alexander Clark [asc@aclark.demon.co.uk]
    with this same subject.

    Note that Fisher-Yates is not "complete", in that there
    are many possible shuffles that are never returned.
    It is "fair", in that all the
    results have an equal probablility of being chosen.
    Search for "fisher-yates perl abigail"
    for more details, and/or see
    http://www.bumppo.net/lists/fun-with-perl/2000/07/msg00016.html
     
    Hopefully helpfully yours,
    Steve

    -- 
    Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
    Fidelity Investments   82 Devonshire St. V8D     Boston MA 02109
    There is nothing so practical as a good theory.  Comments are by me, 
    not Fidelity Investments, its subsidiaries or affiliates.
    

    > -----Original Message----- > From: Rosie Jones [mailto:rosie+@cs.cmu.edu] > Sent: Thursday, March 21, 2002 2:56 PM > To: Tony Berber Sardinha > Cc: corpora list - messages to list > Subject: Re: Corpora: code for random selection of concordance lines > > > On Thu, 21 Mar 2002, Tony Berber Sardinha wrote: > > I wonder if anyone has a bit of perl or java code (or unix > utilities) > > for drawing an x number of lines at random from a concordance? > [...] > > I was going to post this as a private reply, then remembered > that I began > programming in perl after someone sent a snippet of perl for > word-counting > to the corpora list a number of years ago, and thought > someone else might > benefit in the same way... > > Assuming the concordance is small enough to fit in memory, > the following > code should work (though admittedly not tested with DOS line-breaks): > > --- begin perl code > #!/usr/bin/perl > $numlinestoselect = shift; # get the number of lines from the > command line > $myfile = "myconcordance.txt"; # could also get this from the > command line > open(CONCORDANCE, $myfile) || die "Cannot open concordance file > $myfile\n"; > @lines = <CONCORDANCE>; # read ALL lines into memory; > close(CONCORDANCE); # just to be tidy > shift @lines; # get rid of the first line > $totallines = scalar(@lines); # find out how many lines there are > if ($totallines < $numlinestoselect) { die "Can't select more > lines than > there are\n" }; > $linessampled = 0; > srand; # seed the random number generator > while ($linessampled < $numlinestoselect) { > $rand = rand($totallines); # pick a line with uniform probability > if (! $seen[$rand]) { # don't want to select the same line twice > print $lines[$rand]; > $seen[$rand] = 1; > $linessampled++; # one more line towards our goal > } > } > # end of perl code > --- > > Rosie Jones > PhD student > School of Computer Science > Carnegie Mellon University > rosie@cs.cmu.edu http://www.cs.cmu.edu/~rosie/ > > >



    This archive was generated by hypermail 2b29 : Fri Mar 22 2002 - 14:52:45 MET