random array elements and speed

J

Jacob JKW

I've got a 400 or so elements from which I want to select a small
variable number of random elements with each pass of a loop that runs
100 million times.

This is what I've been doing:

I cached the unmodified array, and then with each pass enumerate it as
a new array and then take that reference. Then while I started off by
using Fisher-Yates, but then found because I was only interested in a
small number of elements I was better off simply splicing a random
element from the array. (I know that the FAQ cautions this against this
in thee general case but in this instance splicing has proved faster.
Actually splicing is also fatser than swapping the random element with
the last element and then "pop"-ing it off the end of the array -- that
actually came as a bit of surprise to me). Nevertheless, it seems that
the operation is still pretty slow.

Here's a contrived snippet (I'm actually using the Mersenne Twister as
implemented in Math::Random::MT instead of perl's built-in rand
function as my PNRG but that's of course beside the point):

# START code snippet

my $ARRAY_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16]

for (1 .. 100_000) {
my $copy_ra = [ @{ $ARRAY_ra } ];
for (1 .. 10) {
# select 10 random elements from the array
# in reality all that is known anoout the actual number of
elemnts is that
# the number is small relative to the size of the array
&get_ele( $copy_ra );
}
};

sub get_ele {
my $ar = shift;

# gnerate random index of an elements in the array
my $r = int(rand($#{$ar} + 1));

# return selected element, modifying array in place;
return splice @{$ar}, $r, 1;}

## END SNIPPET

So any thoughts on how possibly to speed this up?


Thanks,
J.
 
U

Uri Guttman

JJ> # START code snippet

JJ> my $ARRAY_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16]

why the mixed case names? and ra (i assume it means ref to array) is a
poor (meaning uncommon) suffix. and redundant since you also have ARRAY
there.

JJ> for (1 .. 100_000) {
JJ> my $copy_ra = [ @{ $ARRAY_ra } ];

i bet that copy is what is killing you. i am sure there are ways to do
this without copying. if all you want is 10 elements each time, just
picking 10 indices and not allowing dups (a short hash or even a scan of
the short array will do) is all you need.

JJ> for (1 .. 10) {
JJ> # select 10 random elements from the array
JJ> # in reality all that is known anoout the actual number of
JJ> elemnts is that
JJ> # the number is small relative to the size of the array
JJ> &get_ele( $copy_ra );

don't use & when calling subs. one it isn't needed and two if used
without () it will do potentially nasty things. perldoc perlsub for more
on this.

JJ> }
JJ> };

JJ> sub get_ele {
JJ> my $ar = shift;

$aref or even just $arr or $array are fine. hard to have an array in
scalar unless it is a ref.

JJ> # gnerate random index of an elements in the array
JJ> my $r = int(rand($#{$ar} + 1));

the idiom for selecting a random index from an array is:

rand( @array )

yours is slower due to the + 1 and the unneeded int. why is int()
unneeded? because the expression in an array index is provied integer
context which means it does an explicit int for you

JJ> # return selected element, modifying array in place;
JJ> return splice @{$ar}, $r, 1;}

that could be (and drop the whole $r thing):

return splice @{$ar}, rand( @{$ar}, 1 ;

also you call this sub for each element of the selected array and that
is a lot of overhead. make this sub return all the selected elements and
you lower the call costs by a large amount.


so the lower sub could look like this (untested):

sub get_unique_rand_elements {

my( $aref, $count ) = @_ ;

my %seen ;
my @indices ;

# get the size here so we don't deref in the loop

my $asize = @{$aref} ;

while( $count > 0 ) {

# need int here since we are checking for dups and floats won't work

my $ind = int rand $asize ;

next if $seen{ $ind } ;

$seen{ $ind } = 1 ;
push @indices, $ind ;

$count-- ;
}

return @{$aref}{@indices} ;
}

i would be very surprised if that isn't much faster than your
version. no copying, no splicing, fewer calls, etc.

uri
 
J

Jacob JKW

Hi Uri,

First off, thanks much for the reply.

Uri said:
JJ> # START code snippet

JJ> my $ARRAY_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16]

why the mixed case names? and ra (i assume it means ref to array) is a
poor (meaning uncommon) suffix. and redundant since you also have ARRAY
there.
I hear you. It's just the way I started doing things a while back. I
use capitals to represent a constant, and the _r?+ suffix to describe
the data structure. So _rhhah would be a refernce to hash of hashes of
arrays if hashes. I have no idea where I got that from. The name
"ARRAY" is used in this particular example only as an abstraction. ##
END BLATHER
JJ> for (1 .. 100_000) {
JJ> my $copy_ra = [ @{ $ARRAY_ra } ];

i bet that copy is what is killing you. i am sure there are ways to do
this without copying. if all you want is 10 elements each time, just
picking 10 indices and not allowing dups (a short hash or even a scan of
the short array will do) is all you need.
That's an excellent point in re the copying. Problem is I just don't
see an efficient way to avoid dupes without somehow "keeping score".
Rememeber also I don't necessarily know how many elemnts I'm going to
want until I actually have the last element.

don't use & when calling subs. one it isn't needed and two if used
without () it will do potentially nasty things. perldoc perlsub for more
on this.
Funnily enough, I was just browsing through perlsub. I do understand
the dangers of using & without parens as it relates to protyping
(however as I don't understand prototyping at all I never use it) and
the @_ array, but nevertheless still like to visually diiferentiate
between a builtin function and a user defined one. Perhaps the better
solution given my own peculiar wants would not be to elimnate & but
rather to always use ().

JJ> sub get_ele {
JJ> my $ar = shift;

$aref or even just $arr or $array are fine. hard to have an array in
scalar unless it is a ref.
Well, maybe I was just using "ar" as an abbreviation for "array". ;-)
JJ> # gnerate random index of an elements in the array
JJ> my $r = int(rand($#{$ar} + 1));

the idiom for selecting a random index from an array is:

rand( @array )
Duly noted.
yours is slower due to the + 1 and the unneeded int. why is int()
unneeded? because the expression in an array index is provied integer
context which means it does an explicit int for you

JJ> # return selected element, modifying array in place;
JJ> return splice @{$ar}, $r, 1;}

that could be (and drop the whole $r thing):

return splice @{$ar}, rand( @{$ar}, 1 ;
Hmm, never thought of that. That's pretty cool. :)

also you call this sub for each element of the selected array and that
is a lot of overhead. make this sub return all the selected elements and
you lower the call costs by a large amount.
Problem is I don;' know how many elemnts I need. I suppose I could
estimate and then just pull out additional elements later on if needed.
so the lower sub could look like this (untested):

sub get_unique_rand_elements {

my( $aref, $count ) = @_ ;

my %seen ;
my @indices ;

# get the size here so we don't deref in the loop

my $asize = @{$aref} ;

while( $count > 0 ) {

# need int here since we are checking for dups and floats won't work

my $ind = int rand $asize ;

next if $seen{ $ind } ;

$seen{ $ind } = 1 ;
push @indices, $ind ;

$count-- ;
}

return @{$aref}{@indices} ;
}

i would be very surprised if that isn't much faster than your
version. no copying, no splicing, fewer calls, etc.
No doubt it would be.

However, I would actually need to maintain %seen outside the array, in
case I were to need more elements than those in the returned array.
Also, I'm a little uncomfortable with the concept of running a process
that could theoretiocally never terminate. I know I initially posited
the problem by saying that I'd only need a small number of elemnts from
the original array, but small's a relative term and the expected speed
degredation I believe would be exponential as the number of required
elemnts increased. Nevertheless, your points are certainly very well
taken.

Much appreciated,
J.
 
J

Jacob JKW

Jim said:
Jacob JKW said:
I've got a 400 or so elements from which I want to select a small
variable number of random elements with each pass of a loop that runs
100 million times.
[discussion of approach snipped]
# START code snippet

my $ARRAY_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16]

for (1 .. 100_000) {
my $copy_ra = [ @{ $ARRAY_ra } ];
for (1 .. 10) {
# select 10 random elements from the array
# in reality all that is known anoout the actual number of
elemnts is that
# the number is small relative to the size of the array
&get_ele( $copy_ra );
}
};

sub get_ele {
my $ar = shift;

# gnerate random index of an elements in the array
my $r = int(rand($#{$ar} + 1));

# return selected element, modifying array in place;
return splice @{$ar}, $r, 1;}

## END SNIPPET

So any thoughts on how possibly to speed this up?

You can speed it up by not messing with the array and not having to
make a copy for each sample. Generate random indices in a separate
array and check for duplicates. Since your array is much bigger than
your sample, you won't have very many duplicates.
Hmm, I suppose the fact that both you and Uri gave rather similar
responses should suggest that the two of you are likely on to
something. I'm just uncomfortable using a method that is so incredibly
sensitive to the size of the desired array. I know I specically
mentioned that said array would be small in relation to the master
array, but I just hate to thionk that speed would decay exponentially
if on a few particular iterations that condition breaks down. My fault.

Anyway, I'll need to think about this a bit more. Thank you very much
for the code idea. Appreciated.

J.
 
X

xhoster

Jacob JKW said:
No doubt it would be.

However, I would actually need to maintain %seen outside the array, in
case I were to need more elements than those in the returned array.

Doesn't it return the number of elements you asked for? If so, why would
you ask for a number that is less than the number you need?
Also, I'm a little uncomfortable with the concept of running a process
that could theoretiocally never terminate.

There is only one reality, but the nice thing about theories is that
you can pick and choose them. So pick a theory that includes things like
gamma rays from space randomly flipping a register somewhere throwing your
program execution into complete chaos. Now, *every* processes can
theoretically never terminate. So get over that hang up, and then get on
with life :)
I know I initially posited
the problem by saying that I'd only need a small number of elemnts from
the original array, but small's a relative term and the expected speed
degredation I believe would be exponential as the number of required
elemnts increased.

I'm pretty sure it wouldn't be exponential. But in any case, if the number
desired is more than the half of the original size, then select elements
to be omitted rather than to be kept.

Xho
 
U

Uri Guttman

R> Uri Guttman schreef:
R> Though to be safe, you should use

R> $[ + rand( @array )

no, that isn't safe but IMNSHO very misguided. from perlvar:

As of release 5 of Perl, assignment to $[ is treated as a
compiler directive, and cannot influence the behavior of any
other file. (That's why you can only assign compile-time
constants to it.) Its use is highly discouraged.

so only if the OP changed $[ in that file would it be needed. and i
agree that it should be discouraged. have you actually ever seen $[ used
in real production code? if i did, i would fire the coder for such a
dumb design that it was needed. 0 basing is fine as long as you always
do it. it is when you start changing $[ that you can end up in
trouble. it will affect ALL array accesses in that file and on top of
that it affects offsets into strings so it changes index() and
substr(). you can localize it but that doesn't make it any better an
idea.

uri
 
U

Uri Guttman

JJ> Hmm, I suppose the fact that both you and Uri gave rather similar
JJ> responses should suggest that the two of you are likely on to
JJ> something. I'm just uncomfortable using a method that is so
JJ> incredibly sensitive to the size of the desired array. I know I
JJ> specically mentioned that said array would be small in relation to
JJ> the master array, but I just hate to thionk that speed would decay
JJ> exponentially if on a few particular iterations that condition
JJ> breaks down. My fault.

so do what someone else said, track the omitted entries. and you can do
other things to make it more deterministic. try handling 'collisions'
the same way some hash method do it, scan sequentially from the
collision spot until you find a unique index. this will always terminate
but it could run more slowly on average than the keep trying random
slots. this is the same tradeoff that ethernet vs token ring
make. ethernet uses random backoffs upon collisions so it isn't
deterministic but it works very well in the real world on lightly loaded
systems. token ring will always work but it wastes overhead under light
loads and it shines under heavy load. there are even some protocols that
switch between them based on load. you can do the same. if the selected
set size is small compared to the master set, use the random with
retry. if the selected set is large, use a collision handling method. i
would suspect that in no case would a splice method be faster especially
when the selected size gets very large.

uri
 
J

Jacob JKW

Doesn't it return the number of elements you asked for? If so, why would
you ask for a number that is less than the number you need?
In my zeal for simplicitly I certainly didn't adequately explain the
issue. The problem is that I don't know exactly how many elements I
need until I've picked the last one. So in other words, while the
average number of elements I need is quite small, the actual number of
elements needed is in fact a function of the previous elements picked.
There is only one reality, but the nice thing about theories is that
you can pick and choose them. So pick a theory that includes things like
gamma rays from space randomly flipping a register somewhere throwing your
program execution into complete chaos. Now, *every* processes can
theoretically never terminate. So get over that hang up, and then get on
with life :)
Heh. Now that brought me a smile. :) Easier said than done, though. I
think the real issue is why am I so neurotic? (But that has nothing to
do with Perl.)

I'm pretty sure it wouldn't be exponential.
Actually, I believe the rate of change of the expected number of picks
to retrieve a nunduplicated element would increase *more* than
exponetially wrt number of selections already made. If you have n
elements in the initial set and on the ith (i<=n) try you keep picking
until you've selected an element that has yet to have already been
picked on any of the i-1 tries, then on average it would take you
i/(n-i) picks to retrieve a unique element. (But that, too, has nothing
to do with Perl.)

Anyway, xho. Thanks for the reply. Certainly much appreciated. :)
 
J

Jacob JKW

Uri said:
JJ> Hmm, I suppose the fact that both you and Uri gave rather similar
JJ> responses should suggest that the two of you are likely on to
JJ> something. I'm just uncomfortable using a method that is so
JJ> incredibly sensitive to the size of the desired array. I know I
JJ> specically mentioned that said array would be small in relation to
JJ> the master array, but I just hate to thionk that speed would decay
JJ> exponentially if on a few particular iterations that condition
JJ> breaks down. My fault.
so do what someone else said, track the omitted entries. and you can do
other things to make it more deterministic. try handling 'collisions'
the same way some hash method do it, scan sequentially from the
collision spot until you find a unique index.
An interesting notion but I don't think that that would work. Remember
that each element is supposed to be randomly selected. If I were to
simply scan sequentially from the point of collision, then obviously my
results would be biased towards those elements immediately succeeding
earlier picks.
this will always terminate
but it could run more slowly on average than the keep trying random
slots.
I honestly don't think I could sleep at night if I used the random
slots method. The way at look at it is that if at some point my data
changes and I'm picking a larger percentage of the initial elements
then yeah the splice method might well be two or three or four times as
slow as fisher-yates. But the random selection method could be a
thousand times slower. And yes, I freely admit that's rather neurotic.
this is the same tradeoff that ethernet vs token ring
make. ethernet uses random backoffs upon collisions so it isn't
deterministic but it works very well in the real world on lightly loaded
systems. token ring will always work but it wastes overhead under light
loads and it shines under heavy load. there are even some protocols that
switch between them based on load.
Great analogy. I like that. :)
you can do the same. if the selected
set size is small compared to the master set, use the random with
retry. if the selected set is large, use a collision handling method. i
would suspect that in no case would a splice method be faster especially
when the selected size gets very large.
I think I've come up with a solution. It might not be the best solution
but it's certainly much faster than my original copy and splice. Even
more importantly it's going to let me sleep at night. Anyway, here's
what how I've changed it:

# START code snippet


my $array_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16];
my $discards_ra = [];


for (1 .. 100_000) {
for (1 .. 10) {
# select 10 random elements from the array
# in reality all that is known anoout the actual number of
# elemnts is that
# the number is small relative to the size of the array
&get_ele( $array_ra );

# do stuff -- possibly break out of the loop, possibly even run
&get_ele a few more time
}
push @$array_ra, @$discards_ra;
@$discards_ra = [];
}

sub get_ele {
my $ar = shift;

# get generate random element from array and modify array in place
my $ele = splice @{$ar}, rand( @{$ar} ), 1 ; # thank you Uri

# save element for later
push @$DISCARDS_ar, $ele;

# return selected element
return $ele;
}

## END SNIPPET
 
U

Uri Guttman

JJ> Hmm, I suppose the fact that both you and Uri gave rather similar
JJ> responses should suggest that the two of you are likely on to
JJ> something. I'm just uncomfortable using a method that is so
JJ> incredibly sensitive to the size of the desired array. I know I
JJ> specically mentioned that said array would be small in relation to
JJ> the master array, but I just hate to thionk that speed would decay
JJ> exponentially if on a few particular iterations that condition
JJ> breaks down. My fault.JJ> An interesting notion but I don't think that that would work. Remember
JJ> that each element is supposed to be randomly selected. If I were to
JJ> simply scan sequentially from the point of collision, then obviously my
JJ> results would be biased towards those elements immediately succeeding
JJ> earlier picks.

you can come up with another collision mechanism then. some combination
of retries and scanning may work.

JJ> I honestly don't think I could sleep at night if I used the random
JJ> slots method. The way at look at it is that if at some point my data
JJ> changes and I'm picking a larger percentage of the initial elements
JJ> then yeah the splice method might well be two or three or four times as
JJ> slow as fisher-yates. But the random selection method could be a
JJ> thousand times slower. And yes, I freely admit that's rather neurotic.

so go with the swap of last and picked elements and pop. splice is never
the way to go as it will have to move half the list on average for each
pick. swapping and pop will blow that away for large lists.

JJ> I think I've come up with a solution. It might not be the best solution
JJ> but it's certainly much faster than my original copy and splice. Even
JJ> more importantly it's going to let me sleep at night. Anyway, here's
JJ> what how I've changed it:

JJ> my $array_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16];
JJ> my $discards_ra = [];

JJ> for (1 .. 100_000) {
JJ> for (1 .. 10) {
JJ> # select 10 random elements from the array
JJ> # in reality all that is known anoout the actual number of
JJ> # elemnts is that
JJ> # the number is small relative to the size of the array
JJ> &get_ele( $array_ra );

JJ> # do stuff -- possibly break out of the loop, possibly even run
JJ> &get_ele a few more time
JJ> }
JJ> push @$array_ra, @$discards_ra;
JJ> @$discards_ra = [];

that line doesn't do what you think it does.

JJ> }

JJ> sub get_ele {
JJ> my $ar = shift;

JJ> # get generate random element from array and modify array in place
JJ> my $ele = splice @{$ar}, rand( @{$ar} ), 1 ; # thank you Uri

JJ> # save element for later
JJ> push @$DISCARDS_ar, $ele;

JJ> # return selected element
JJ> return $ele;
JJ> }

that is not going to be much faster than your original code. you still
call the inner sub too often (you can pass in a count and then call it
more if you want).

case matters $DISCARDS_ar vs $discards_ra

and swap/pop should be faster than splice as i said. i am not sure why
you think it is not. can you show a benchmark proving that?

as for sleeping, take some ambien!

uri
 
J

Jacob JKW

Uri said:
JJ> I honestly don't think I could sleep at night if I used the random
JJ> slots method. The way at look at it is that if at some point my data
JJ> changes and I'm picking a larger percentage of the initial elements
JJ> then yeah the splice method might well be two or three or four times as
JJ> slow as fisher-yates. But the random selection method could be a
JJ> thousand times slower. And yes, I freely admit that's rather neurotic.

so go with the swap of last and picked elements and pop. splice is never
the way to go as it will have to move half the list on average for each
pick. swapping and pop will blow that away for large lists.
Actually, what I've found is that in this context swap/pop is faster
than splice for large arrays but slower for small arrays. I have no
explanantion.
JJ> I think I've come up with a solution. It might not be the best solution
JJ> but it's certainly much faster than my original copy and splice. Even
JJ> more importantly it's going to let me sleep at night. Anyway, here's
JJ> what how I've changed it:

JJ> my $array_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16];
JJ> my $discards_ra = [];

JJ> for (1 .. 100_000) {
JJ> for (1 .. 10) {
JJ> # select 10 random elements from the array
JJ> # in reality all that is known anoout the actual number of
JJ> # elemnts is that
JJ> # the number is small relative to the size of the array
JJ> &get_ele( $array_ra );

JJ> # do stuff -- possibly break out of the loop, possibly even run
JJ> &get_ele a few more time
JJ> }
JJ> push @$array_ra, @$discards_ra;
JJ> @$discards_ra = [];

that line doesn't do what you think it does.
That line *should* have read $discards_ra = []; My bad.
JJ> }

JJ> sub get_ele {
JJ> my $ar = shift;

JJ> # get generate random element from array and modify array in place
JJ> my $ele = splice @{$ar}, rand( @{$ar} ), 1 ; # thank you Uri

JJ> # save element for later
JJ> push @$DISCARDS_ar, $ele;

JJ> # return selected element
JJ> return $ele;
JJ> }

that is not going to be much faster than your original code. you still
call the inner sub too often (you can pass in a count and then call it
more if you want).
The problem just is that I don't know how many times I need to call the
array, until I examine the previous calls. This method actually works
out to be signioficantly faster than the previous method I had
initially proffered.
case matters $DISCARDS_ar vs $discards_ra
Whoops. :(
and swap/pop should be faster than splice as i said. i am not sure why
you think it is not. can you show a benchmark proving that?
Sure:

testc.pl:

#!perl

use strict;
use warnings;
use Benchmark qw(cmpthese);

my @ARRAY = (0 .. 412);

cmpthese(100_000, {
'swap_pop' => \&use_swap_pop,
'splice' => \&use_splice,
});

sub use_splice {
my @ar = @ARRAY;
for (0 .. 100) {
splice @ar, rand( @ar ), 1 ;
}
}

sub use_swap_pop{
my @ar = @ARRAY;

for (0 .. 100) {
my $r = rand(@ar);
my $n = $#ar;

@ar[$n,$r] = @ar[$r,$n];
pop @ar;
}
}
perl testc.pl
Rate swap_pop splice
swap_pop 3958/s -- -47%
splice 7477/s 89% --
as for sleeping, take some ambien!
Hey, that sounds good to me.
 
A

A. Sinan Unur

In my zeal for simplicitly I certainly didn't adequately explain the
issue. The problem is that I don't know exactly how many elements I
need until I've picked the last one. So in other words, while the
average number of elements I need is quite small, the actual number of
elements needed is in fact a function of the previous elements picked.

Jon Bentley has a great example in "Programming Pearls" which shows how to
pick k integers randomly from [0 .. n) by scanning only once. The
procedure is unbiased in principle (although I would replace the modulus
with a division just to be on the safe side. See
http://c-faq.com/lib/randrange.html).

C code is available at:

http://netlib.bell-labs.com/cm/cs/pearls/bitsortgen.c

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
J

Jacob JKW

A. Sinan Unur said:
In my zeal for simplicitly I certainly didn't adequately explain the
issue. The problem is that I don't know exactly how many elements I
need until I've picked the last one. So in other words, while the
average number of elements I need is quite small, the actual number of
elements needed is in fact a function of the previous elements picked.

Jon Bentley has a great example in "Programming Pearls" which shows how to
pick k integers randomly from [0 .. n) by scanning only once. The
procedure is unbiased in principle (although I would replace the modulus
with a division just to be on the safe side. See
http://c-faq.com/lib/randrange.html).

C code is available at:

http://netlib.bell-labs.com/cm/cs/pearls/bitsortgen.c
Yep. Thanks for that. Looks to be a C implementation of the
Fisher-Yates shuffle referenced in perlfaq4 "How do I shuffle an array
randomly?"

One of the biggest problem of course is that the seed and the seed
alone will uniquely determine the entire array sequence. If you're
dealing with a single deck of playing cards then the array has 52! =
8.07E+67 possible permutations. which orders and of magnitude greater
than the number of 32 bit seeds. As such if one really wants all arrays
states to be possible it's necessary to use a randmization process that
can accept greater than 32 bit seeds.

Anyway, that''s why I think it makes sense to use the Mersenne Twister
PNRG (I;m using it as implemented in Math::Random::MT). MT takes up to
taking up to 623 32 bit integers as a seed. and as such has a
periodicty of 2**19936.
 
J

Jacob JKW

JJ> I honestly don't think I could sleep at night if I used the random
JJ> slots method. The way at look at it is that if at some point my data
JJ> changes and I'm picking a larger percentage of the initial elements
JJ> then yeah the splice method might well be two or three or four times as
JJ> slow as fisher-yates. But the random selection method could be a
JJ> thousand times slower. And yes, I freely admit that's rather neurotic.

so go with the swap of last and picked elements and pop. splice is never
the way to go as it will have to move half the list on average for each
pick. swapping and pop will blow that away for large lists.
Just found this in the CPLM archives:

http://groups.google.com/group/comp...thread/da705489201879ba?#doc_76f7615391573a3a

I think I'll have to give List::Util::shuffle() a serious look.
 
A

A. Sinan Unur

Jon Bentley has a great example in "Programming Pearls" which shows

Actually, I don't have the book with me, so I think I posted the wrong
URL before. I don't think the example I was referring to is available
from the book web site.

Anyway, here is a more Perlish implementation of the idea as I remember
it. The algorithm ensures that any given number in the range is chosen
with the same probability.

Here are some simulation results for choosing 10 out of 1000:

#!/usr/bin/perl

use strict;
use warnings;

my %dist;

for ( 1 .. 10_000 ) {
my @r = randseq(10, 1000);
$dist{$_}++ for @r;
}

for my $k ( sort { $a <=> $b } keys %dist ) {
printf "Prob(%d) = %.3f\n", $k, $dist{$k}/10_000;
}

sub randseq {
my ($k, $n) = @_;

my @r;
my $c = 0;

while ( @r != $k ) {
if ( rand(1) < ($k - @r)/($n - $c) ) {
push @r, $c;
}
++$c;
}

return @r;
}
__END__






--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
A

A. Sinan Unur

A. Sinan Unur said:
In my zeal for simplicitly I certainly didn't adequately explain
the issue. The problem is that I don't know exactly how many
elements I need until I've picked the last one. So in other words,
while the average number of elements I need is quite small, the
actual number of elements needed is in fact a function of the
previous elements picked.

Jon Bentley has a great example in "Programming Pearls" which shows
how to pick k integers randomly from [0 .. n) by scanning only once.
The procedure is unbiased in principle (although I would replace the
modulus with a division just to be on the safe side. See
http://c-faq.com/lib/randrange.html).

C code is available at:

http://netlib.bell-labs.com/cm/cs/pearls/bitsortgen.c
Yep. Thanks for that. Looks to be a C implementation of the
Fisher-Yates shuffle referenced in perlfaq4 "How do I shuffle an array
randomly?"

Please see my other post. I don't think the code I reference is what I
was thinking of (up all night again!)

Now, your original problem was how to choose $k integers from $n
integers, not choosing $n out of $n in random order.

Of course, the quality of the PRNG you are using will determine the
distribution of the subsequences chosen.

On the other hand, the following function will ensure distinct numbers
where the probability of a specific number occuring in the returned
subsequence being $k/$n.

# cut
sub randseq {
my ($k, $n) = @_;

my @r;
my $c = 0;

while ( @r != $k ) {
if ( rand(1) < ($k - @r)/($n - $c) ) {
push @r, $c;
}
++$c;
}

return @r;
}
# cut

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
D

Dr.Ruud

Uri Guttman schreef:
Ruud:
Uri:
the idiom for selecting a random index from an array is:

rand( @array )

Though to be safe, you should use

$[ + rand( @array )

no, that isn't safe but IMNSHO very misguided. from perlvar:

As of release 5 of Perl, assignment to $[ is treated as a
compiler directive, and cannot influence the behavior of any
other file. (That's why you can only assign compile-time
constants to it.) Its use is highly discouraged.

Right, I completely overlooked the 'per file' attribute.

have you actually ever seen $[
used in real production code? if i did, i would fire the coder for
such a dumb design that it was needed.

I have used it in a test-program, that went something like this:

#!/usr/bin/perl
use strict;
use warnings;

my @ary = (3, 2, 1, 0, 2, 4, 6);

print "\n";

{ print "[GOOD]\n";
my $i = 0;
printf "%3d %3d\n", $i++, $_ for @ary;
print "\n";
}

{ print "[SO-SO]\n";
local $[ = 2;
printf "%3d %3d\n", $_, $ary[$_]
for $[ .. $#ary;
print "\n";
}

{ print "[BAD]\n";
local $[ = -3;
printf "%3d %3d\n", $_, $ary[$_]
for $[ .. $#ary;
print "\n";
}

{ print "[GOOD]\n";
my $ary_base = int -@ary/2;
printf "%3d %3d\n", $ary_base + $_, $ary[$_]
for $[ .. $#ary;
print "\n";
}
0 basing is fine as long as
you always do it. it is when you start changing $[ that you can end
up in trouble.

This trouble makes me think of VB[0-7]'s "OPTION BASE", though that only
affects arrays. In VB you could even set the index limits with
declaration, like "Dim ary(-5 To 5)", a bit like Pascal. VB.NET has only
0-based arrays, but the first position in a string is still 1.
In Perl a negative index works from the end towards the start.

it will affect ALL array accesses in that file and on top of
that it affects offsets into strings so it changes index() and
substr(). you can localize it but that doesn't make it any better an
idea.

I agree.
 
X

xhoster

I believe that slice only moves 1/4 of the elements (or pointers really) on
average. Still not great, but I often surprised at how fast splice
is even on moderately large arrays.

Actually, what I've found is that in this context swap/pop

Don't swap. Overwrite the chosen element with the last element, then pop.
Why bother copying the chosen element to the last slot, only to discard it?

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top