Iterating hashes

D

Dave Saville

If I have

foreach (sort keys %hash)

Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

I am having some problems with a script that has two hashes upon which
I am trying to do inner and outer joins amongst other things. The
hashes are roughly the same size and have over 60,000 keys the
majority of the keys have a length of approx 70 characters. The hash
values are a three element array: A mixed case copy of the key and two
integers.
 
P

Peter Makholm

Dave Saville said:
Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

See the relevant part of the perlfunc manual page or
http://perldoc.perl.org/functions/each.html

//Makholm
 
R

Rainer Weikusat

Peter Makholm said:
You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-4,
{
keys => sub {
my $v;

$v = $h{$_} for keys(%h);
},

sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

scalar_each => sub {
my ($v, $k);

$v = $h{$k} while $k = each(%h);
},

list_each => sub {
my ($v, $k);

1 while ($k, $v) = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
 
D

Dave Saville

Peter Makholm said:
You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-4,
{
keys => sub {
my $v;

$v = $h{$_} for keys(%h);
},

sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

scalar_each => sub {
my ($v, $k);

$v = $h{$k} while $k = each(%h);
},

list_each => sub {
my ($v, $k);

1 while ($k, $v) = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;

===
100 keys
===
Rate sorted_keys scalar_each list_each keys
values
sorted_keys 5715/s -- -28% -36% -46%
-85%
scalar_each 7956/s 39% -- -10% -25%
-80%
list_each 8863/s 55% 11% -- -16%
-77%
keys 10602/s 85% 33% 20% --
-73%
values 39161/s 585% 392% 342% 269%
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
Peter Makholm said:
Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");


sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

That's hardly a fair comparison. In fact, 'values' coming out faster is
a red herring as well: it's only happening because the values are all 1
which is much faster to copy than a string.

With a fairer test like

use Benchmark qw/cmpthese/;

my %h = map +("$_", "$_"), 1..60_000;
my $x;

cmpthese -5, {
keys => sub { $x = $_ for keys %h },
sort => sub { $x = $_ for sort keys %h },
values => sub { $x = $_ for values %h },
keach => sub { 1 while $x = each %h },
veach => sub { 1 while $x = (each %h)[1] },
};

Why would this be 'a fair test' when the keys are copied for no
particular reason while no attempt is made to determine the values
except for the 'values' and 'each in list context' cases? Iterating
over the keys of a hash while not looking at the values associated
with those keys at all seems to be a rather bizarre idea of a use
case.
 
W

Willem

Rainer Weikusat wrote:
) Why would this be 'a fair test' when the keys are copied for no
) particular reason while no attempt is made to determine the values
) except for the 'values' and 'each in list context' cases? Iterating
) over the keys of a hash while not looking at the values associated
) with those keys at all seems to be a rather bizarre idea of a use
) case.

Perl doesn't have a 'set' type, and typically a hash is used for that, and
that is a perfectly legitimate use case for using only the keys of a hash.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
D

Dr.Ruud

I'm not sure why
'values' is cheaper than 'keys', but I suspect it has something to do
with the fact that hash keys are shared.

perldoc -f keys:

The returned values are copies of the original keys in the hash, so
modifying them will not affect the original hash.


perldoc -f values:

Note that the values are not copied, which means modifying them will
modify the contents of the hash
 
R

Rainer Weikusat

Willem said:
Rainer Weikusat wrote:
) Why would this be 'a fair test' when the keys are copied for no
) particular reason while no attempt is made to determine the values
) except for the 'values' and 'each in list context' cases? Iterating
) over the keys of a hash while not looking at the values associated
) with those keys at all seems to be a rather bizarre idea of a use
) case.

Perl doesn't have a 'set' type, and typically a hash is used for that, and
that is a perfectly legitimate use case for using only the keys of a
hash.

While you're of course right on that, I regard this is 'rather
bizarre' use case for a hash.
 
R

Rainer Weikusat

Dave Saville said:
]
use Benchmark qw(cmpthese);
[...]

[...]

===
100 keys
===
Rate sorted_keys scalar_each list_each keys
values
sorted_keys 5715/s -- -28% -36% -46%
-85%
scalar_each 7956/s 39% -- -10% -25%
-80%
list_each 8863/s 55% 11% -- -16%
-77%
keys 10602/s 85% 33% 20% --
-73%
values 39161/s 585% 392% 342% 269%

The first column is the 'speed' in terms of 'executions per second',
the other columns show the 'speed differences' of the code of the
current row relative to the others, expressed as percentage of the
speed of the 'column algorithm'. Eg, for 'keys', the sorted_keys
column says '85%'. This is calculated as

int((10602 - 5715) / 5715 * 100)
 
R

Rainer Weikusat

[...]
In any case, I thought the purpose here was to benchmark 'for (keys)'
and 'while (each)' as methods of iterating over a hash. The difference
between them is so small that a single extra assignment, or an
assignment that copies a string rather than a number, will completely
swamp the difference you are trying to measure, at least until you get
to the point where the extra memory allocated by keys causes the process
to start swapping.

Testing just this, the result is (as I already knew from a past
thread) that keys is (on the computer where I tested this)
significantly faster than each for hashes with up to 10,000 keys.

----------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];


print("\n===\n$_[0] keys\n===\n");

cmpthese(-5,
{
keys => sub {
1 for keys(%h);
},

each => sub {
my $k;

1 while $k = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
-----------------

I'm somewhat uncertain what "it is possible to keep adding unrelated
code to each example until that dominates the execution time so
overwhelmingly that this difference can't be measured anymore" is
supposed to communicate in this context ...
 
G

George Mpouras

Στις 14/5/2013 16:37, ο/η Dave Saville έγÏαψε:
I am trying to do inner and outer joins amongst other things. The

if you explain what to you mean with "inner and outer joins" I could
provide a small piece of code.
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
Ben Morrow <[email protected]> writes:
[...]
Testing just this, the result is (as I already knew from a past
thread) that keys is (on the computer where I tested this)
significantly faster than each for hashes with up to 10,000 keys.

----------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];


print("\n===\n$_[0] keys\n===\n");

cmpthese(-5,
{
keys => sub {
1 for keys(%h);
},

each => sub {
my $k;

1 while $k = each(%h);

Did you actually read what I wrote? That 'each' benchmark does an extra
assignment, so of *course* it's slower.

The 'each' benchmark does not do 'an extra assignment': In order to
use the key (similar to the earlier one which purposely included an
operation extracting the value in every subroutine iterating over the
keys) in a loop, it has to be stored somewhere (aka 'assigned to
something') which is part of the cost of using each, whereas (term?)
keys can be used with for like any other 'term' (term?) returning a
list. Because of this

1. If iterating over the keys and (usually) examining the values is
desired, keys is the sensible choice in almost all cases and each
becomes the sensible choice once the hashes become 'large'.

2. If iterating over the values is sufficient, values is the way to
go, except possibly for extremly large hashes (> 1,000,000 keys for
this example).

3. If a predictable 'iteration order' is desired/ required, there's
not much which can be done except sorting the keylist beforehand (But
when comparing hashes, this can be avoided by iterating over the keys
of one hash, looking for each key in the other hash and deleteing it
if it was found. If the loop didn't terminate early because of a
mismatch, whatever is left in the second after it has was stuff not
contained in the first).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,836
Messages
2,569,748
Members
45,545
Latest member
rapter____0

Latest Threads

Top