RE: Corpora: code for random selection of concordance lines

From: Rosie Jones (rosie+@cs.cmu.edu)
Date: Fri Mar 22 2002 - 23:00:19 MET

Next message: Bruce L. Lambert, Ph.D.: "Re: Corpora: code for random selection of concordance lines"

Previous message: Rada Mihalcea: "Corpora: CFP: Deadline Extension for ACL Workshop on "Word Sense Disambiguation""
In reply to: Tolkin, Steve: "RE: Corpora: code for random selection of concordance lines"
Reply: Radu Florian: "Re: Corpora: code for random selection of concordance lines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Interestingly, in practice this method is not appreciably slower. For the
cited example, selecting 999 lines from a 1000 line concordance, (and
doing it 100 times over) doing the fisher yates method on my machine takes
1 second, while the sampling-and-checking method takes 3 seconds. When
selecting 999 lines from a 10,000 line concordance (100 times), the
sampling-and-checking method takes 1 second, while the fisher yates
algorithm (which after all has to shuffle 10,000 lines) takes 14 seconds.
Samples of up to 75% of the total concordance size are faster with the
sampling-and-checking method.

So for large concordance size, with sample size of 10%-75% of concordance
size, the apparently more naive algorithm appears to be more efficient.
For small concordance size, you probably won't notice the difference in
time anyway. For large concordance size and large sample size, the fisher
yates method is probably the best way.

Rosie Jones.

"There's more than one way to do it"

On Fri, 22 Mar 2002, Tolkin, Steve wrote:

> Do NOT use this approach! It can be pathologiclaly slow.
> Consider what happens in the while loop
> if e.g. you ask for 999 samples from a 1000 line file.
> Getting the last few samples can take a very long time,
> as you repeatedly hit lines that have already used chosen.
>
> Instead use the Fisher-Yates algorithm, described in the
> recent post by Alexander Clark [asc@aclark.demon.co.uk]
> with this same subject.

[...]

Next message: Bruce L. Lambert, Ph.D.: "Re: Corpora: code for random selection of concordance lines"
Previous message: Rada Mihalcea: "Corpora: CFP: Deadline Extension for ACL Workshop on "Word Sense Disambiguation""
In reply to: Tolkin, Steve: "RE: Corpora: code for random selection of concordance lines"
Reply: Radu Florian: "Re: Corpora: code for random selection of concordance lines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 22 2002 - 23:34:58 MET