random array elements and speed

xhoster · Apr 28, 2006

Jacob JKW said:
Anyway, that''s why I think it makes sense to use the Mersenne Twister
PNRG (I;m using it as implemented in Math::Random::MT). MT takes up to
taking up to 623 32 bit integers as a seed. and as such has a
periodicty of 2**19936.

Out of curiousity, how do you get those 623 (624, actually) 32 bit integers
to use as the seed in the first place?

Be warned that if you export rand and srand, Math::Random::MT doesn't
properly prototype them. So "rand(@ar)" will no longer do what you think
it does. You need "rand(scalar @ar)".

Anyway, Math::Random::MT is slower than the built in rand, so if you are
going to use this you should benchmark with it, as it could substantially
change the results.

Xho

Uri Guttman · Apr 28, 2006

JJ> Actually, what I've found is that in this context swap/pop is faster
JJ> than splice for large arrays but slower for small arrays. I have no
JJ> explanantion.

that isn't hard to explain at all. splice is a single internal perl op
and swap/pop is several perl ops. a very rough cost estimate of perl
speed is how many perl ops are executed since the op dispatch loop is
the big overhead compared to many of the actual operations. so for short
arrays, splice will be 1 perl op and move a small number of elements
(which is done in c). when the size gets larger the element move takes
longer. the swap/pop is a fixed amount of work for any size array but it
does need several perl ops to execute. so at some size, the splice will
become slower than the fixed overhead of swap/pop. this is a classic
example of O(N) > O(1) growth. you compare growth curves and not actual
times. even bubble sort which is O(N**2) can be faster than most fast
sorts (which are O(N log N)) when the data set size is tiny as bubble
sorts usually have less overhead to start.

JJ> my @ARRAY = (0 .. 412);

JJ> cmpthese(100_000, {

a good trick is to use negative numbers there as that will mean a total
wall clock time to run. setting the count can be annoying when comparing
some benchmarks as you don't know how long it will take in real
time. and this is another trick i use so i can set the max real time
from the command line (or default to 2 seconds total for each test):

cmpthese( shift || -2, {

JJ> 'swap_pop' => \&use_swap_pop,
JJ> 'splice' => \&use_splice,
JJ> });

JJ> sub use_splice {
JJ> my @ar = @ARRAY;
JJ> for (0 .. 100) {
JJ> splice @ar, rand( @ar ), 1 ;
JJ> }
JJ> }

JJ> sub use_swap_pop{
JJ> my @ar = @ARRAY;

JJ> for (0 .. 100) {
JJ> my $r = rand(@ar);
JJ> my $n = $#ar;

JJ> @ar[$n,$r] = @ar[$r,$n];

drop the $n and use:

@ar[-1,$r] = @ar[$r,-1];

JJ> pop @ar;
JJ> }
JJ> }
JJ> Rate swap_pop splice
JJ> swap_pop 3958/s -- -47%
JJ> splice 7477/s 89% --

you are running into the very issue i discussed above. you can't test
this kind of code with a single size set. the 413 elements is probably
too short to show the growth curve differences. i would run this with
varying set sizes like 500, 10000 and 1000000. you should notice the
differences then. just put the outer array init and benchmark call into
a loop with a list of sizes and use my code changes (you want to compare
the fastest version of each).

uri

Jacob JKW · Apr 28, 2006

Out of curiousity, how do you get those 623 (624, actually) 32 bit integers
to use as the seed in the first place?

624. Right.

use Math::Random::MT;
my ($rand_gen,);
BEGIN {
my @seed;
foreach (0 .. 623) {
push @seed, ((1+rand(2**15)) * (1+rand(2**15)) * (1+rand(2**2))
- 1);
}
$rand_gen = Math::Random::MT->new(@seed);
}

Be warned that if you export rand and srand, Math::Random::MT doesn't
properly prototype them. So "rand(@ar)" will no longer do what you think
it does. You need "rand(scalar @ar)".

Yeah I figured that one out the hard way.

Anyway, Math::Random::MT is slower than the built in rand, so if you are
going to use this you should benchmark with it, as it could substantially
change the results.

It certainly does slow it all down, but it clearly does so uniformly,
only biasing the results against excess calls to the rand function.

Jacob JKW · Apr 28, 2006

A. Sinan Unur said:
Actually, I don't have the book with me, so I think I posted the wrong
URL before. I don't think the example I was referring to is available
from the book web site.

Anyway, here is a more Perlish implementation of the idea as I remember
it. The algorithm ensures that any given number in the range is chosen
with the same probability.

Here are some simulation results for choosing 10 out of 1000:

#!/usr/bin/perl

use strict;
use warnings;

my %dist;

for ( 1 .. 10_000 ) {
my @r = randseq(10, 1000);
$dist{$_}++ for @r;
}

for my $k ( sort { $a <=> $b } keys %dist ) {
printf "Prob(%d) = %.3f\n", $k, $dist{$k}/10_000;
}

sub randseq {
my ($k, $n) = @_;

my @r;
my $c = 0;

while ( @r != $k ) {
if ( rand(1) < ($k - @r)/($n - $c) ) {
push @r, $c;
}
++$c;
}

return @r;
}
__END__

Just as you point out. The results will only be as robust as the PNRG
used. Your method's probably a lot more clever than that which I'd ever
come up with on my own, but the only problem is that there when
requiring $k integers on an $n wide range as many as $n calls might
need to be called.

xhoster · Apr 28, 2006

Jacob JKW said:
624. Right.

use Math::Random::MT;
my ($rand_gen,);
BEGIN {
my @seed;
foreach (0 .. 623) {
push @seed, ((1+rand(2**15)) * (1+rand(2**15)) * (1+rand(2**2))
- 1);
}
$rand_gen = Math::Random::MT->new(@seed);
}

This no good. The seed you are feeding to MT is completely determined
by the seed used by CORE::srand (which is called implicitly the first
time you use CORE::rand). Since there are only ~4 billion possible
seeds for CORE::srand (last I checked, anyway), there are only ~4 billion
possible seeds that your method will use for MT.

So while it may have a periodicity of 2**19936, you are only accessing
2**32 (or maybe 2**31, or 2**64) different entry points into that stream,
which I'm pretty sure will defeat the purpose.

Yeah I figured that one out the hard way.

Me too. The hash method wouldn't terminate, because the only number ever
returned was 0 and it was always a collision. But then the non-hash
methods were silently giving me non-random answers. Usually I'd rather
have a program that never gives an answer than one that promptly and
silently gives the wrong answer.

Xho

Dr.Ruud · Apr 28, 2006

Jacob JKW schreef:

I've got a 400 or so elements from which I want to select a small
variable number of random elements with each pass of a loop that runs
100 million times.

If you need speed and are OK with a not-much-of-a-random, you could use
a simple formula to select the next element.

perl -le '($s, $n, $i, $p, $c)=(1, 400, 0, 0, " ");
$x = $c x $n;
while( $c < "9" ) {
$s = (1103515245 * $s + 12345) % (2**31);
$c = ++substr($x, $p, 1);
printf "%3d [%3d] %s %s\n", $i++, $p, $c, "*" x ($p/6);
$p = $s % $n
}; print STDERR "\n$x\n"'

There are more and even better formulas like that.

Jacob JKW · Apr 28, 2006

Jim said:
Jacob JKW said:

Jim Gibson wrote:

Click to expand...

[original code snipped]

Hmm, I suppose the fact that both you and Uri gave rather similar
responses should suggest that the two of you are likely on to
something. I'm just uncomfortable using a method that is so incredibly
sensitive to the size of the desired array. I know I specically
mentioned that said array would be small in relation to the master
array, but I just hate to thionk that speed would decay exponentially
if on a few particular iterations that condition breaks down. My fault.

Click to expand...

I don't understand why you say the proposed method is "incredibly
sensitive to the size of the desired array". While it is true that the
probability of picking an item already picked is proportional to the
number of items alreay picked, that is not an exponential dependency.
It is proportional to the number desired. So the time to do the picking
is something on the order of n-squared, but less because you start with
none picked. I do not have the statistical smarts to give you an exact
dependency, but I know that the probability of a collision ranges from
0 for the first pick to (m-1)/n to the m-th pick from n elements.

Just as you say, if # of items already picked is m-1, and the total #
of items is n, then that probability of collision on pick m is (m-1)/n.
Hence the probability of it taking exactly i picks to retrieve element
m+1 is given by (n-m+1)/n * ((m-1)/n)^(i-1). This means the
expected number of picks to retrieve a unique (m+1)th element is
E(picks) = (n-m)/n * SUM(i=0 to infinity) [i*(m-1)/n)^(i-1)]
which converges to
E(picks)= n/(n-m+1) which you're very right when you say does not
exhinbit an exponential rate of change.

Mea Culpa!

Jacob JKW · Apr 28, 2006

I believe that slice only moves 1/4 of the elements (or pointers really) on
average. Still not great, but I often surprised at how fast splice
is even on moderately large arrays.

Makes sense.

Don't swap. Overwrite the chosen element with the last element, then pop.
Why bother copying the chosen element to the last slot, only to discard it?

Well I need that element anyway and have to copy it somewhere. Why not
copy it to the last slot so popping the array returns the needed value?

A. Sinan Unur · Apr 28, 2006

A. Sinan Unur wrote:
....

Just as you point out. The results will only be as robust as the PNRG
used. Your method's probably a lot more clever than that which I'd
ever come up with on my own, but the only problem is that there when
requiring $k integers on an $n wide range as many as $n calls might
need to be called.

True, definitely true.

It is not really my method, though. Now, if I could just find that book
;-)

Anyway, it does work beautifully for reasonable sets of numbers. It does
have bad worst case performance, but it has low memory overhead, and no
splicing or indexing, so good locality and low overhead may mask that
bad worst case performance.

Just my two cents with no benchmarks whatsoever.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Anno Siegel · Apr 28, 2006

[selecting random elements from an array]

Rememeber also I don't necessarily know how many elemnts I'm going to
want until I actually have the last element.

In that case, I'd combine a Fisher-Yates-like procedure with criterion
that tells when enough elements have been selected. The procedure
select_enough() takes a coderef (the criterion) and an array. It swaps
random elements to the beginning of the array until the first n elements
satisfy the criterion function.

sub select_enough {
my ( $crit, $ar) = @_;
my $n = 0;
while ( 1 ) {
# randomize element $n
@$ar[ $n, $_] = @$ar[ $_, $n] for $n + rand( @$ar - $n);
last if $crit->( @$ar[ 0 .. $n]);
$n ++;
}
@$ar[ 0 .. $n];
}

So, to select elements from a set of random numbers until their sum is
larger than 1, this can be used:

use List::Util qw( sum);
my @array = map rand, 1 .. 400;
my @sel = select_enough( sub { sum( @_) > 1 }, \ @array);

The array can be re-used because it still contains all the original
values, though partially re-ordered.

Anno

John Bokma · Apr 29, 2006

Jacob JKW said:
I hear you. It's just the way I started doing things a while back. I
use capitals to represent a constant, and the _r?+ suffix to describe
the data structure. So _rhhah would be a refernce to hash of hashes of
arrays if hashes.

Look up articles on the Hungarian naming scheme (as it's called) and you
stop using it soon enough

If you need rhhah to make your code clear, your code in itself is not
clear enough. Fix that first.

Peter J. Holzer · Apr 29, 2006

John said:
Look up articles on the Hungarian naming scheme (as it's called) and
you stop using it soon enough

I think the original proposal for the Hungarian notation made a lot of
sense: It encoded the type of the variable as defined by the
application's model. So for example, if you had a length foo and an area
bar, you would write them as lFoo and aBar.

Then it is immediately obvious that an expression like

lFoo + aBar

cannot be right. You can't add a length and an area.

Hungarian notation as I've seen it used is rather useless, as it uses
the types of the programming language. So if foo and bar are both
floating point numbers the expression would read

fFoo + fBar

which doesn't give any hint that something is wrong. Even if foo was an
integer for some reason, that wouldn't matter as there is nothing wrong
with adding an integer and a floating point number.

Wikipedia tells me that the original proposal is now known as "Apps
Hungarian Notation" and the commonly used form as "System Hungarian
Notation".

If you need rhhah to make your code clear,

rhhah is a perfect example of why System Hungarian Notation isn't very
useful, at least in languages with very few types like Perl (and in
languages with lots of types and strong typing you don't need it because
the compiler will catch type errors): There are three "h" in the type
mnemonic, but which is which? You can't tell, whether

$x_rhhah->{$foo_s}{$bar_s}[$baz_s]{$gazonk_s}

is correct or if you should have written

$x_rhhah->{$bar_s}{$foo_s}[$baz_s]{$gazonk_s}

(You can't even tell which variable needs to be used for the array
index, because all of them have a _s suffix)

hp

John Bokma · Apr 29, 2006

Peter J. Holzer said:
I think the original proposal for the Hungarian notation made a lot of
sense: It encoded the type of the variable as defined by the
application's model.

It didn't IMO

So for example, if you had a length foo and an area
bar, you would write them as lFoo and aBar.

Then it is immediately obvious that an expression like

lFoo + aBar

cannot be right. You can't add a length and an area.

wouldn't:

length + area

not be more readable, moreover, imagine that you have a lFoo, a hFoo, a
wFoo and an aFoo, and ditto for Bar. Wouldn't just:

floor_length, floor_area

etc. not be *more* readable?

To me, no matter how you use Hungarian notation it's a too strict system
with severe limitations, it doesn't stop bad programmers from making
mistakes, since no system can do that [1]. And it doesn't help experienced
programmers, it might even slow them down.

[1] In your example, a bad programmer might have just replaced the + with
a * since then the expression looks right.

Peter J. Holzer · Apr 29, 2006

John said:
It didn't IMO

wouldn't:

length + area

not be more readable,

Which length and which area? Your program may have lots of them.

moreover, imagine that you have a lFoo, a hFoo, a wFoo and an aFoo,
and ditto for Bar. Wouldn't just:

floor_length, floor_area

etc. not be *more* readable?

To you and me, maybe, but not to the linker. It would see two variables
called "floor_", so it would either treat them as the same variable or
throw a "duplicate symbol" error.

I can't find Simonyi's original paper at the moment, but it must have
been written in the 1970s (Wikipedia mentions it was first used for
BCPL, the predecessor of C). A limit of 6 characters on identifiers
wasn't uncommon in these days (see also the 9th commandment in Henry
Spencer's "Ten Commandments for C Programmers"), so you wanted to keep
your identifiers short.

Besides, long identifiers may be more readable in isolation, but they
make formulas messy. To make your whole program readable, you have to
find a compromise between single letter variable names and trying to
pack the whole documentation into the variable name.

Personally, I would probably use $floor_length and $floor_area in a perl
program. I prefer suffixes to prefixes for the type, and underscores to
camelcase, because it sounds more English and it puts the more
distinctive part first. If I had to do complicated computations with
them, I'd probably just use $fl and $fa, with a comment near their
declaration.

[1] In your example, a bad programmer might have just replaced the +
with a * since then the expression looks right.

Yeah. I've taught programming long enough (a long time ago). I was
always amazed how students would do totally non-sensical things to shut
up compiler warnings.

hp

John Bokma · Apr 30, 2006

Peter J. Holzer said:
John Bokma wrote:

Which length and which area? Your program may have lots of them.

program? I hope you mean function. And in that case, you add extra
information:

floor_length
floor_area

much more clear then lFloor or aFloor

To you and me, maybe, but not to the linker. It would see two
variables called "floor_", so it would either treat them as the same
variable or throw a "duplicate symbol" error.

In which case you are using quite an outdated linker :-D

[Long ago]

:-D.

Besides, long identifiers may be more readable in isolation, but they
make formulas messy.

A formula doesn't have to fit on a single line. If you use short names
like a, b, c etc. you have to explain what they mean somewhere (unless
in the trivial case it's obvious), and hence you end up with a piece of
documentation, and a formula, which must be kept in sync.

Personally, I would probably use $floor_length and $floor_area in a
perl program. I prefer suffixes to prefixes for the type, and
underscores to camelcase,

Yup, CamELCaSE is harder to read IMO, but on the other hand I try to
stick to the code conventions in a language.

because it sounds more English and it puts the more
distinctive part first. If I had to do complicated computations with
them, I'd probably just use $fl and $fa, with a comment near their
declaration.

See my remark above.

[1] In your example, a bad programmer might have just replaced the +
with a * since then the expression looks right.

Click to expand...

Yeah. I've taught programming long enough (a long time ago). I was
always amazed how students would do totally non-sensical things to
shut up compiler warnings.

:-D Classic one: "I've turned warnings off, so it gives less messages on
:the screen when I compile".

--
John Bokma Freelance software developer

&
Experienced Perl programmer:
http://castleamber.com/

robic0 · Apr 30, 2006

I've got a 400 or so elements from which I want to select a small
variable number of random elements with each pass of a loop that runs
100 million times.

This is what I've been doing:

I cached the unmodified array, and then with each pass enumerate it as
a new array and then take that reference. Then while I started off by
using Fisher-Yates, but then found because I was only interested in a
small number of elements I was better off simply splicing a random
element from the array. (I know that the FAQ cautions this against this
in thee general case but in this instance splicing has proved faster.
Actually splicing is also fatser than swapping the random element with
the last element and then "pop"-ing it off the end of the array -- that
actually came as a bit of surprise to me). Nevertheless, it seems that
the operation is still pretty slow.

Here's a contrived snippet (I'm actually using the Mersenne Twister as
implemented in Math::Random::MT instead of perl's built-in rand
function as my PNRG but that's of course beside the point):

# START code snippet

my $ARRAY_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16]

for (1 .. 100_000) {
my $copy_ra = [ @{ $ARRAY_ra } ];
for (1 .. 10) {
# select 10 random elements from the array
# in reality all that is known anoout the actual number of
elemnts is that
# the number is small relative to the size of the array
&get_ele( $copy_ra );
}
};

sub get_ele {
my $ar = shift;

# gnerate random index of an elements in the array
my $r = int(rand($#{$ar} + 1));

# return selected element, modifying array in place;
return splice @{$ar}, $r, 1;}

## END SNIPPET

So any thoughts on how possibly to speed this up?

Thanks,
J.

Bringing this inner consideration to the outer level (against the rules!)
for readability. Actually so it can be seen I guess.

There is a third method that was overlooked. Indeed its the most promising
as it has the best performance over the complete range.
It's a flag method. Excuse me if has been mentioned before. It is at the top of
the performance scale over a wide range. The determining factor is the sum of what
it does, as robic0 goes to assembly infinity...

I used a medium distribution where I didn't have to sit here forever waiting
for benchmarks. See below for details. I thought you would like to know this
before you bury your head in PhD books. I mean, if speed is all you wan't.
Extrapolate to the working model. If you don't see a clear path to that, or some
other obstruction, let me know and I will think about it.

#!perl

use strict;
use warnings;
use Benchmark qw(cmpthese);

my (@ARRAY, $ssize, $asize);
my @Set_sizes = (150,300,1000,2000);
my @Ary_sizes = (2500,3000,3500,5000);

for (@Ary_sizes)
{
$asize = $_;
@ARRAY = (0 .. $asize);

for (@Set_sizes)
{
$ssize = $_;
print "\n\n***>> $ssize/$asize\n";
cmpthese(10000, {
'swap_pop' => \&use_swap_pop,
'splice' => \&use_splice,
'flag' => \&use_flag,
} );
}
}

sub use_splice {
my @ar = @ARRAY;
for (0 .. $ssize) {
splice @ar, rand( @ar ), 1 ;
}
}

sub use_swap_pop{
my @ar = @ARRAY;

for (0 .. $ssize) {
my $r = rand(@ar);
my $n = $#ar;

@ar[$n,$r] = @ar[$r,$n];
pop @ar;
}
}

sub use_flag {
my $asize = @ARRAY;
my $rnumb;
my @flag_array;

for (0 .. $ssize) {
while (defined $flag_array[($rnumb = int ( rand ( $asize ) ))]) {}
$flag_array[$rnumb] = 1;
}
}

sub debug_use_flag {
my $asize = @ARRAY;
my $rnumb;
my @flag_array;

for (0 .. $ssize) {
while (defined $flag_array[($rnumb = int ( rand ( $asize ) ))]) {
print $rnumb,"\n";
}
$flag_array[$rnumb] = 1;
print " $rnumb\n";
}
print "=================================\n";
<>;
}

__END__

***>> 150/2500
Rate swap_pop splice flag
swap_pop 2105/s -- -1% -74%
splice 2119/s 1% -- -74%
flag 8203/s 290% 287% --

***>> 300/2500
Rate swap_pop splice flag
swap_pop 1410/s -- -3% -67%
splice 1455/s 3% -- -66%
flag 4266/s 203% 193% --

***>> 1000/2500
Rate swap_pop splice flag
swap_pop 560/s -- -13% -55%
splice 642/s 15% -- -48%
flag 1233/s 120% 92% --

***>> 2000/2500
Rate swap_pop splice flag
swap_pop 301/s -- -25% -37%
splice 403/s 34% -- -16%
flag 478/s 59% 19% --

***>> 150/3000
Rate splice swap_pop flag
splice 1783/s -- -4% -77%
swap_pop 1855/s 4% -- -77%
flag 7899/s 343% 326% --

***>> 300/3000
Rate splice swap_pop flag
splice 1253/s -- -4% -70%
swap_pop 1301/s 4% -- -69%
flag 4211/s 236% 224% --

***>> 1000/3000
Rate swap_pop splice flag
swap_pop 542/s -- -1% -56%
splice 548/s 1% -- -56%
flag 1243/s 129% 127% --

***>> 2000/3000
Rate swap_pop splice flag
swap_pop 295/s -- -14% -45%
splice 344/s 17% -- -36%
flag 536/s 82% 55% --

***>> 150/3500
Rate splice swap_pop flag
splice 1550/s -- -8% -80%
swap_pop 1689/s 9% -- -78%
flag 7710/s 398% 357% --

***>> 300/3500
Rate splice swap_pop flag
splice 1094/s -- -10% -74%
swap_pop 1214/s 11% -- -71%
flag 4155/s 280% 242% --

***>> 1000/3500
Rate splice swap_pop flag
splice 482/s -- -8% -62%
swap_pop 525/s 9% -- -58%
flag 1253/s 160% 139% --

***>> 2000/3500
Rate swap_pop splice flag
swap_pop 290/s -- -3% -49%
splice 299/s 3% -- -47%
flag 564/s 94% 88% --

***>> 150/5000
Rate splice swap_pop flag
splice 1081/s -- -15% -85%
swap_pop 1277/s 18% -- -82%
flag 7189/s 565% 463% --

***>> 300/5000
Rate splice swap_pop flag
splice 776/s -- -21% -81%
swap_pop 988/s 27% -- -75%
flag 4000/s 416% 305% --

***>> 1000/5000
Rate splice swap_pop flag
splice 354/s -- -26% -72%
swap_pop 477/s 35% -- -62%
flag 1262/s 257% 165% --

***>> 2000/5000
Rate splice swap_pop flag
splice 212/s -- -23% -65%
swap_pop 274/s 29% -- -54%
flag 602/s 184% 119% --

robic0 · Apr 30, 2006

I've got a 400 or so elements from which I want to select a small
variable number of random elements with each pass of a loop that runs
100 million times.

This is what I've been doing:

I cached the unmodified array, and then with each pass enumerate it as
a new array and then take that reference. Then while I started off by
using Fisher-Yates, but then found because I was only interested in a
small number of elements I was better off simply splicing a random
element from the array. (I know that the FAQ cautions this against this
in thee general case but in this instance splicing has proved faster.
Actually splicing is also fatser than swapping the random element with
the last element and then "pop"-ing it off the end of the array -- that
actually came as a bit of surprise to me). Nevertheless, it seems that
the operation is still pretty slow.

Here's a contrived snippet (I'm actually using the Mersenne Twister as
implemented in Math::Random::MT instead of perl's built-in rand
function as my PNRG but that's of course beside the point):

# START code snippet

my $ARRAY_ra = [(qw(1 2 3 4 5 6 7 8 9 10 11 12 13)) x 16]

for (1 .. 100_000) {
my $copy_ra = [ @{ $ARRAY_ra } ];
for (1 .. 10) {
# select 10 random elements from the array
# in reality all that is known anoout the actual number of
elemnts is that
# the number is small relative to the size of the array
&get_ele( $copy_ra );
}
};

sub get_ele {
my $ar = shift;

# gnerate random index of an elements in the array
my $r = int(rand($#{$ar} + 1));

# return selected element, modifying array in place;
return splice @{$ar}, $r, 1;}

## END SNIPPET

So any thoughts on how possibly to speed this up?

Thanks,
J.

Click to expand...

Bringing this inner consideration to the outer level (against the rules!)
for readability. Actually so it can be seen I guess.

There is a third method that was overlooked. Indeed its the most promising
as it has the best performance over the complete range.
It's a flag method. Excuse me if has been mentioned before. It is at the top of
the performance scale over a wide range. The determining factor is the sum of what
it does, as robic0 goes to assembly infinity...

I used a medium distribution where I didn't have to sit here forever waiting
for benchmarks. See below for details. I thought you would like to know this
before you bury your head in PhD books. I mean, if speed is all you wan't.
Extrapolate to the working model. If you don't see a clear path to that, or some
other obstruction, let me know and I will think about it.

#!perl

use strict;
use warnings;
use Benchmark qw(cmpthese);

my (@ARRAY, $ssize, $asize);
my @Set_sizes = (150,300,1000,2000);
my @Ary_sizes = (2500,3000,3500,5000);

for (@Ary_sizes)
{
$asize = $_;
@ARRAY = (0 .. $asize);

for (@Set_sizes)
{
$ssize = $_;
print "\n\n***>> $ssize/$asize\n";
cmpthese(10000, {
'swap_pop' => \&use_swap_pop,
'splice' => \&use_splice,
'flag' => \&use_flag,
} );
}
}

sub use_splice {
my @ar = @ARRAY;
for (0 .. $ssize) {
splice @ar, rand( @ar ), 1 ;
}
}

sub use_swap_pop{
my @ar = @ARRAY;

for (0 .. $ssize) {
my $r = rand(@ar);
my $n = $#ar;

@ar[$n,$r] = @ar[$r,$n];
pop @ar;
}
}

sub use_flag {
my $asize = @ARRAY;
my $rnumb;
my @flag_array;

for (0 .. $ssize) {
while (defined $flag_array[($rnumb = int ( rand ( $asize ) ))]) {}
$flag_array[$rnumb] = 1;
}
}

sub debug_use_flag {
my $asize = @ARRAY;
my $rnumb;
my @flag_array;

for (0 .. $ssize) {
while (defined $flag_array[($rnumb = int ( rand ( $asize ) ))]) {
print $rnumb,"\n";
}
$flag_array[$rnumb] = 1;
print " $rnumb\n";
}
print "=================================\n";
<>;
}

__END__

***>> 150/2500
Rate swap_pop splice flag
swap_pop 2105/s -- -1% -74%
splice 2119/s 1% -- -74%
flag 8203/s 290% 287% --

***>> 300/2500
Rate swap_pop splice flag
swap_pop 1410/s -- -3% -67%
splice 1455/s 3% -- -66%
flag 4266/s 203% 193% --

***>> 1000/2500
Rate swap_pop splice flag
swap_pop 560/s -- -13% -55%
splice 642/s 15% -- -48%
flag 1233/s 120% 92% --

***>> 2000/2500
Rate swap_pop splice flag
swap_pop 301/s -- -25% -37%
splice 403/s 34% -- -16%
flag 478/s 59% 19% --

***>> 150/3000
Rate splice swap_pop flag
splice 1783/s -- -4% -77%
swap_pop 1855/s 4% -- -77%
flag 7899/s 343% 326% --

***>> 300/3000
Rate splice swap_pop flag
splice 1253/s -- -4% -70%
swap_pop 1301/s 4% -- -69%
flag 4211/s 236% 224% --

***>> 1000/3000
Rate swap_pop splice flag
swap_pop 542/s -- -1% -56%
splice 548/s 1% -- -56%
flag 1243/s 129% 127% --

***>> 2000/3000
Rate swap_pop splice flag
swap_pop 295/s -- -14% -45%
splice 344/s 17% -- -36%
flag 536/s 82% 55% --

***>> 150/3500
Rate splice swap_pop flag
splice 1550/s -- -8% -80%
swap_pop 1689/s 9% -- -78%
flag 7710/s 398% 357% --

***>> 300/3500
Rate splice swap_pop flag
splice 1094/s -- -10% -74%
swap_pop 1214/s 11% -- -71%
flag 4155/s 280% 242% --

***>> 1000/3500
Rate splice swap_pop flag
splice 482/s -- -8% -62%
swap_pop 525/s 9% -- -58%
flag 1253/s 160% 139% --

***>> 2000/3500
Rate swap_pop splice flag
swap_pop 290/s -- -3% -49%
splice 299/s 3% -- -47%
flag 564/s 94% 88% --

***>> 150/5000
Rate splice swap_pop flag
splice 1081/s -- -15% -85%
swap_pop 1277/s 18% -- -82%
flag 7189/s 565% 463% --

***>> 300/5000
Rate splice swap_pop flag
splice 776/s -- -21% -81%
swap_pop 988/s 27% -- -75%
flag 4000/s 416% 305% --

***>> 1000/5000
Rate splice swap_pop flag
splice 354/s -- -26% -72%
swap_pop 477/s 35% -- -62%
flag 1262/s 257% 165% --

***>> 2000/5000
Rate splice swap_pop flag
splice 212/s -- -23% -65%
swap_pop 274/s 29% -- -54%
flag 602/s 184% 119% --

Noticing splice and pop each populate the default variable,
I thought I would be fair and re-run the flag with an extra
asignment. Its still top dog, just a few percentages lower.

sub use_flag {
my $asize = @ARRAY;
my $rnumb;
my @flag_array;

for (0 .. $ssize) {
while (defined $flag_array[($rnumb = int ( rand ( $asize ) ))]) {}
$flag_array[$rnumb] = 1;
$_ = $ARRAY[$rnumb];
}
}

***>> 150/2500
Rate swap_pop splice flag
swap_pop 2119/s -- -1% -68%
splice 2140/s 1% -- -68%
flag 6667/s 215% 211% --

***>> 300/2500
Rate swap_pop splice flag
swap_pop 1432/s -- -3% -58%
splice 1471/s 3% -- -57%
flag 3441/s 140% 134% --

***>> 1000/2500
Rate swap_pop splice flag
swap_pop 573/s -- -11% -43%
splice 644/s 12% -- -36%
flag 1003/s 75% 56% --

***>> 2000/2500
Rate swap_pop splice flag
swap_pop 307/s -- -24% -25%
splice 406/s 32% -- -1%
flag 410/s 34% 1% --

***>> 150/3000
Rate splice swap_pop flag
splice 1773/s -- -5% -72%
swap_pop 1860/s 5% -- -71%
flag 6398/s 261% 244% --

***>> 300/3000
Rate splice swap_pop flag
splice 1238/s -- -5% -64%
swap_pop 1309/s 6% -- -62%
flag 3404/s 175% 160% --

***>> 1000/3000
Rate splice swap_pop flag
splice 545/s -- -1% -46%
swap_pop 548/s 1% -- -46%
flag 1013/s 86% 85% --

***>> 2000/3000
Rate swap_pop splice flag
swap_pop 303/s -- -10% -34%
splice 336/s 11% -- -27%
flag 462/s 52% 37% --

***>> 150/3500
Rate splice swap_pop flag
splice 1538/s -- -8% -76%
swap_pop 1675/s 9% -- -74%
flag 6337/s 312% 278% --

***>> 300/3500
Rate splice swap_pop flag
splice 1079/s -- -11% -68%
swap_pop 1210/s 12% -- -64%
flag 3386/s 214% 180% --

***>> 1000/3500
Rate splice swap_pop flag
splice 475/s -- -10% -54%
swap_pop 528/s 11% -- -48%
flag 1026/s 116% 94% --

***>> 2000/3500
Rate splice swap_pop flag
splice 291/s -- -1% -38%
swap_pop 293/s 1% -- -38%
flag 472/s 62% 61% --

***>> 150/5000
Rate splice swap_pop flag
splice 1088/s -- -15% -82%
swap_pop 1288/s 18% -- -78%
flag 5924/s 444% 360% --

***>> 300/5000
Rate splice swap_pop flag
splice 777/s -- -22% -76%
swap_pop 994/s 28% -- -70%
flag 3299/s 325% 232% --

***>> 1000/5000
Rate splice swap_pop flag
splice 350/s -- -28% -66%
swap_pop 484/s 38% -- -53%
flag 1026/s 193% 112% --

***>> 2000/5000
Rate splice swap_pop flag
splice 210/s -- -25% -58%
swap_pop 280/s 34% -- -43%
flag 495/s 136% 77% --

Peter J. Holzer · Apr 30, 2006

John said:
In which case you are using quite an outdated linker :-D

I'm sure that linker was quite up to date when Simonyi wrote his paper

.

Do you have a problem with the past tense? I wrote "the original
proposal made at lot of sense".

A formula doesn't have to fit on a single line.

But it's easier to read when it does.

If you use short names like a, b, c etc. you have to explain what they
mean somewhere (unless in the trivial case it's obvious), and hence
you end up with a piece of documentation, and a formula, which must be
kept in sync.

In the non-trivial case I need documentation anyway. Code only shows how
something is done but not why it is done that way.

hp

Jacob JKW · Apr 30, 2006

This no good. The seed you are feeding to MT is completely determined
by the seed used by CORE::srand (which is called implicitly the first
time you use CORE::rand). Since there are only ~4 billion possible
seeds for CORE::srand (last I checked, anyway), there are only ~4 billion
possible seeds that your method will use for MT.

So while it may have a periodicity of 2**19936, you are only accessing
2**32 (or maybe 2**31, or 2**64) different entry points into that stream,
which I'm pretty sure will defeat the purpose.

Well don't I feel silly?

Anyway, (even if it's cheating) this should work:

use Math::Random::MT;
my ($rand_gen,);
BEGIN {
require LWP::Simple;
my
$RAND_URL=\("http://random.org/cgi-bin/randnum?num=1248&min=1&max=65536&col=2");
my (@seed);
foreach (split(/\n/, LWP::Simple::get($$RAND_URL))) {
m/^([0-9]+)\s+([0-9]+)$/;
push @seed, $1 + $2*2**16;
}
$rand_gen = Math::Random::MT->new(@seed);
}

Dr.Ruud · Apr 30, 2006

Peter J. Holzer schreef:

Code only shows
how something is done but not why it is done that way.

Often is doesn't, like:

if ( /^BEGIN/ ) {...}

if ( index( $_, 'BEGIN' ) == 0 ) {...}

if ( substr( $_, 0, 5 ) eq 'BEGIN' ) {...}

but in a more complex algorithm, you can make a lot of the why-s show
through by choosing good variable names etc., and of course by including
appropriate comments. (But maybe you meant code without comments?)

Trying to get the average value of the elements, please help ! JavaScript	3	Dec 13, 2022
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
Replacing elements in array by the larger of its neighbour	0	Oct 12, 2018
How to fetch and console.log all items from an associative array	3	Apr 17, 2024
Random walk across array	6	Jul 16, 2010
Anyone familiar with WP Bakery and/or Visual Composer?	4	Jan 27, 2023
Accessing array index addresses with custom datatype in a function	0	Jun 2, 2022
a random sub-array	2	Sep 11, 2007

random array elements and speed

xhoster

Uri Guttman

Jacob JKW

Jacob JKW

xhoster

Dr.Ruud

Jacob JKW

Jacob JKW

A. Sinan Unur

Anno Siegel

John Bokma

Peter J. Holzer

John Bokma

Peter J. Holzer

John Bokma

robic0

robic0

Peter J. Holzer

Jacob JKW

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads