effienct way to select random value from the hash

V

viki

sub RandomElement { # select random element from the hash
my ($hashref) = @_;
my @keys = keys(%{$hashref});
my $random = $hashref->{ $keys[ int(rand( @keys )) ]};
}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call to
RandomElement.

What would be the fastest version of RandomElement ?

Thanks
Viki
 
X

xhoster

viki said:
sub RandomElement { # select random element from the hash
my ($hashref) = @_;
my @keys = keys(%{$hashref});
my $random = $hashref->{ $keys[ int(rand( @keys )) ]};
}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call to
RandomElement.

Use a different data structure.
What would be the fastest version of RandomElement ?

That depends on what is done with $hashref between calls to RandomElement.

And on how random the random element must be.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

viki said:
sub RandomElement { # select random element from the hash
my ($hashref) = @_;
my @keys = keys(%{$hashref});
my $random = $hashref->{ $keys[ int(rand( @keys )) ]};
}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call to
RandomElement.

If you flatten the hash into a list then the keys will be every other
element of that list which you could easily pick by multiplying the
random number by 2 (after adjusting the range accordingly).
I don't know for sure, but treating $hashref as a reference to an array
might do the trick already.

jue
 
J

John W. Krahn

viki said:
sub RandomElement { # select random element from the hash
my ($hashref) = @_;
my @keys = keys(%{$hashref});
my $random = $hashref->{ $keys[ int(rand( @keys )) ]};
}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call to
RandomElement.

What would be the fastest version of RandomElement ?

This appears to be a bit faster:

sub RandomElement { # select random element from the hash
my ( $hashref ) = @_;
my $count;
my $random = ( grep 1 > rand ++$count, values %$hashref )[ -1 ];
}



John
 
J

John W. Krahn

Ben said:
Quoth "John W. Krahn said:
viki said:
sub RandomElement { # select random element from the hash
my ($hashref) = @_;
my @keys = keys(%{$hashref});
my $random = $hashref->{ $keys[ int(rand( @keys )) ]};
}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call to
RandomElement.

What would be the fastest version of RandomElement ?
This appears to be a bit faster:

sub RandomElement { # select random element from the hash
my ( $hashref ) = @_;
my $count;
my $random = ( grep 1 > rand ++$count, values %$hashref )[ -1 ];
}

Presumably

sub RandomElement {
my ($hashref) = @_;
my ($count, $got);

scalar keys %$hashref; # reset iterator
while (1) {
my ($k, $v) = each %$hashref;
$got = $v if 1 > rand ++$count;
}
return $got;
}

would be more efficient for a large hash, since it doesn't build a list
of the values. Since none of these are using the hash for lookup, it
would be better to keep the values in an array and just use
$ary[rand @ary] instead.

Your while (1) loop doesn't end so it would probably be less efficient.

ITYM:

while ( my ( $k, $v ) = each %$hashref ) {
$got = $v if 1 > rand ++$count;
}

But under my Benchmark tests that is less efficient than the OP's version.



John
 
S

smallpond

sub RandomElement { # select random element from the hash
    my ($hashref) = @_;
    my @keys = keys(%{$hashref});
    my $random = $hashref->{ $keys[ int(rand( @keys )) ]};

}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call to
RandomElement.

What would be the fastest version of RandomElement ?

Thanks
Viki


sub ran3($) {
my ($hashref) = @_;
my @vals = values(%{$hashref});
$vals[ int(rand( @vals )) ];
}

10,000 calls to this took 2.341 s

to yours took 3.947 s

I also tried flattening the whole hash into
key/value pairs and choosing an odd element.
That took 6.161 s.
 
J

John W. Krahn

Ben said:
Quoth "John W. Krahn said:
Ben said:
Presumably

sub RandomElement {
my ($hashref) = @_;
my ($count, $got);

scalar keys %$hashref; # reset iterator
while (1) {
my ($k, $v) = each %$hashref;
$got = $v if 1 > rand ++$count;
}
return $got;
}

would be more efficient for a large hash, since it doesn't build a list
of the values. Since none of these are using the hash for lookup, it
would be better to keep the values in an array and just use
$ary[rand @ary] instead.
Your while (1) loop doesn't end so it would probably be less efficient.

:)

I had initially forgotten that the algorithm needs to iterate over the
whole hash, and was planning something like

last if 1 > rand...;

to exit early.
ITYM:

while ( my ( $k, $v ) = each %$hashref ) {
$got = $v if 1 > rand ++$count;
}

But under my Benchmark tests that is less efficient than the OP's version.

Interesting. Presumably it's because there are more ops involved, and a
new BLOCK every time around the while loop.

If you exit early then it is not truly random. You have to iterate over
the complete list to get a random element.

Actually, this appears to be a lot more efficient than the others:

sub RandomElement3 {
my ( $hashref ) = @_;
my $index = rand values %$hashref;
my $random = ( values %$hashref )[ $index ];
}



John
 
A

A. Sinan Unur

sub RandomElement { # select random element from the hash
    my ($hashref) = @_;
    my @keys = keys(%{$hashref});
    my $random = $hashref->{ $keys[ int(rand( @keys )) ]};

}

When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call
to RandomElement.

What would be the fastest version of RandomElement ?
....

sub ran3($) {
my ($hashref) = @_;
my @vals = values(%{$hashref});
$vals[ int(rand( @vals )) ];
}

10,000 calls to this took 2.341 s

Interesting because here is what I get:

C:\Temp> cat t.pl

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw( cmpthese );

my $h;

for (1 .. 10_000) {
$h->{"key$_"} = rand;
}

cmpthese -5, {
smallpond => sub { ran3($h) },
sinan => sub { ran42($h) },
};

sub ran3 {
my ($h) = @_;
my @v = values %$h;
$v[rand @v];
}

sub ran42 {
my ( $h ) = @_;
(values %$h)[rand values %$h ];
}

__END__

C:\Temp> t
Rate smallpond sinan
smallpond 288/s -- -82%
sinan 1569/s 446% --


--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
J

John W. Krahn

darkon said:
Are you sure? The Perl FAQ has a code sample for selecting a random
line from a file without reading the entire file, and references Knuth
for a proof:

perldoc -q "random line"

perldoc -q "random line"

How do I select a random line from a file?

Here’s an algorithm from the Camel Book:

srand;
rand($.) < 1 && ($line = $_) while <>;

This has a significant advantage in space over reading the whole
file in. You can find a proof of this method in The Art of Computer
Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.


Please read the code carefully. At no point does it exit the while loop.

The text says "reading the whole file in" which the code does not do.
But it does read the whole file one line at a time.




John
 
E

Eric Pozharski

On 2009-04-16 said:
I don't know for sure, but treating $hashref as a reference to an array
might do the trick already.

No, it doesn't

{2565:11} [0:0]$ perl -Mstrict -wle 'my $x= { a => "b" }; print scalar @$x'
Not an ARRAY reference at -e line 1.

However,

{2565:11} [0:0]$ perl -Mstrict -wle 'my $x= { a => "b" }; print scalar(() = %$x)'
2

And one more point of concern: serializing hash into array is a subject
of Perl's own key reordering. Maybe just do it once?
 
J

Jürgen Exner

Eric Pozharski said:
On 2009-04-16 said:
I don't know for sure, but treating $hashref as a reference to an array
might do the trick already.

No, it doesn't

{2565:11} [0:0]$ perl -Mstrict -wle 'my $x= { a => "b" }; print scalar @$x'
Not an ARRAY reference at -e line 1.

However,

{2565:11} [0:0]$ perl -Mstrict -wle 'my $x= { a => "b" }; print scalar(() = %$x)'
2
Thanks!

And one more point of concern: serializing hash into array is a subject
of Perl's own key reordering. Maybe just do it once?

Hmmm, when would perl reorder the keys? I would assume only when the
hash itself changes, i.e. when elements are added/removed. And in that
case probably you want to restart the random sequence anyway.

But the key question still remains unanswered: is
serialization/flattening of the hash faster than a keys()?

jue
 
E

Eric Pozharski

Hmmm, when would perl reorder the keys? I would assume only when the
hash itself changes, i.e. when elements are added/removed. And in that
case probably you want to restart the random sequence anyway.

My point was, that when picking random item of already randomized
sequence, the picked item becomes more or less random? I don't know.
But the key question still remains unanswered: is
serialization/flattening of the hash faster than a keys()?

That's easy

perl -MBenchmark=cmpthese,timethese -wle '
my %h = qw| a b c d e f g h i j k l m n |;
my @ph = %h;
my $z;
cmpthese timethese -5, {
keys => sub { $z = (keys %h)[2] },
flat => sub { $z = (() = %h)[4] },
pflt => sub { $z = $ph[4] },
};
'
Benchmark:
running
flat, keys, pflt
for at least 5 CPU seconds
...

flat: 5 wallclock secs ( 5.07 usr + 0.02 sys = 5.09 CPU) @ 203814.73/s (n=1037417)

keys: 5 wallclock secs ( 5.05 usr + 0.01 sys = 5.06 CPU) @ 187490.71/s (n=948703)

pflt: 6 wallclock secs ( 5.74 usr + 0.00 sys = 5.74 CPU) @ 1735072.47/s (n=9959316)

Rate keys flat pflt
keys 187491/s -- -8% -89%
flat 203815/s 9% -- -88%
pflt 1735072/s 825% 751% --

And, the sequence is the same:

perl -we '
my %h = qw| a b c d e f g h i j k l m n |;
my @ph = %h;
print "$_ " foreach keys %h;
print "\n";
print "$ph[$_ * 2] " for 0 .. $#ph / 2;
print "\n";
'
e c k a g m i
e c k a g m i
 
X

Xho Jingleheimerschmidt

Eric said:
My point was, that when picking random item of already randomized
sequence, the picked item becomes more or less random? I don't know.

The order in a hash is not random, it is arbitrary. (In some
implementations of perl, in some situations, at some times, it may be
close to randomized.)


Xho
 
E

Eric Pozharski

Eric Pozharski wrote: *SKIP*

The order in a hash is not random, it is arbitrary. (In some
implementations of perl, in some situations, at some times, it may be
close to randomized.)

Oh, my god! It's even worse -- the order is arbitrary (good point,
BTW). In such circumstances, the only thing could help (me?) --
B<sort>. Luckily, I'm not the OP.
 
S

smallpond

smallpond said:
sub RandomElement { # select random element from the hash
my ($hashref) = @_;
my @keys = keys(%{$hashref});
my $random = $hashref->{ $keys[ int(rand( @keys )) ]};
}
When the hash is large, the creation of @keys for every
invocation is invefficient. It's waste that's O((N)) for every call
to RandomElement.
What would be the fastest version of RandomElement ?
...

sub ran3($) {
my ($hashref) = @_;
my @vals = values(%{$hashref});
$vals[ int(rand( @vals )) ];
}
10,000 calls to this took 2.341 s

Interesting because here is what I get:

C:\Temp> cat t.pl

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw( cmpthese );

my $h;

for (1 .. 10_000) {
$h->{"key$_"} = rand;

}

cmpthese -5, {
smallpond => sub { ran3($h) },
sinan => sub { ran42($h) },

};

sub ran3 {
my ($h) = @_;
my @v = values %$h;
$v[rand @v];

}

sub ran42 {
my ( $h ) = @_;
(values %$h)[rand values %$h ];

}

__END__

C:\Temp> t
Rate smallpond sinan
smallpond 288/s -- -82%
sinan 1569/s 446% --


Your code is clearly better since it avoids creating the
unneeded array. What I was trying to compare was
extracting values vs keys vs flattening the entire hash
and picking an odd element (a value). Using your
method, extracting the list of values still wins,
although maybe there is a better way to do ranf.


cmpthese -5, {
flatten => sub { ranf($h) },
values => sub { ranv($h) },
keys => sub { rank($h) },

};

sub ranf {
my ($h) = @_;
(%$h)[1 + 2 * rand (keys %$h)];
}

sub ranv {
my ($h) = @_;
(values %$h)[rand values %$h];
}

sub rank {
my ($h) = @_;
$h->{(keys %$h)[rand keys %$h]};
}


Rate flatten keys values
flatten 71.3/s -- -3% -87%
keys 73.5/s 3% -- -86%
values 542/s 659% 637% --
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top