RE: Corpora: code for random selection of concordance lines

From: Rosie Jones (rosie+@cs.cmu.edu)
Date: Fri Mar 22 2002 - 23:00:19 MET

  • Next message: Bruce L. Lambert, Ph.D.: "Re: Corpora: code for random selection of concordance lines"

    Interestingly, in practice this method is not appreciably slower. For the
    cited example, selecting 999 lines from a 1000 line concordance, (and
    doing it 100 times over) doing the fisher yates method on my machine takes
    1 second, while the sampling-and-checking method takes 3 seconds. When
    selecting 999 lines from a 10,000 line concordance (100 times), the
    sampling-and-checking method takes 1 second, while the fisher yates
    algorithm (which after all has to shuffle 10,000 lines) takes 14 seconds.
    Samples of up to 75% of the total concordance size are faster with the
    sampling-and-checking method.

    So for large concordance size, with sample size of 10%-75% of concordance
    size, the apparently more naive algorithm appears to be more efficient.
    For small concordance size, you probably won't notice the difference in
    time anyway. For large concordance size and large sample size, the fisher
    yates method is probably the best way.

    Rosie Jones.

    "There's more than one way to do it"

     On Fri, 22 Mar 2002, Tolkin, Steve wrote:

    > Do NOT use this approach! It can be pathologiclaly slow.
    > Consider what happens in the while loop
    > if e.g. you ask for 999 samples from a 1000 line file.
    > Getting the last few samples can take a very long time,
    > as you repeatedly hit lines that have already used chosen.
    >
    > Instead use the Fisher-Yates algorithm, described in the
    > recent post by Alexander Clark [asc@aclark.demon.co.uk]
    > with this same subject.

    [...]



    This archive was generated by hypermail 2b29 : Fri Mar 22 2002 - 23:34:58 MET