Iterating hashes

Dave Saville · May 14, 2013

If I have

foreach (sort keys %hash)

Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

I am having some problems with a script that has two hashes upon which
I am trying to do inner and outer joins amongst other things. The
hashes are roughly the same size and have over 60,000 keys the
majority of the keys have a length of approx 70 characters. The hash
values are a three element array: A mixed case copy of the key and two
integers.

Peter Makholm · May 14, 2013

Dave Saville said:
Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

See the relevant part of the perlfunc manual page or
http://perldoc.perl.org/functions/each.html

//Makholm

Rainer Weikusat · May 14, 2013

Peter Makholm said:
You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-4,
{
keys => sub {
my $v;

$v = $h{$_} for keys(%h);
},

sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

scalar_each => sub {
my ($v, $k);

$v = $h{$k} while $k = each(%h);
},

list_each => sub {
my ($v, $k);

1 while ($k, $v) = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;

Dave Saville · May 14, 2013

Peter Makholm said:
Peter Makholm said:

You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

Click to expand...

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-4,
{
keys => sub {
my $v;

$v = $h{$_} for keys(%h);
},

sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

scalar_each => sub {
my ($v, $k);

$v = $h{$k} while $k = each(%h);
},

list_each => sub {
my ($v, $k);

1 while ($k, $v) = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;

===
100 keys
===
Rate sorted_keys scalar_each list_each keys
values
sorted_keys 5715/s -- -28% -36% -46%
-85%
scalar_each 7956/s 39% -- -10% -25%
-80%
list_each 8863/s 55% 11% -- -16%
-77%
keys 10602/s 85% 33% 20% --
-73%
values 39161/s 585% 392% 342% 269%

Rainer Weikusat · May 14, 2013

Ben Morrow said:
Quoth Rainer Weikusat said:

Peter Makholm said:

Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

Click to expand...

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

Click to expand...

That's hardly a fair comparison. In fact, 'values' coming out faster is
a red herring as well: it's only happening because the values are all 1
which is much faster to copy than a string.

With a fairer test like

use Benchmark qw/cmpthese/;

my %h = map +("$_", "$_"), 1..60_000;
my $x;

cmpthese -5, {
keys => sub { $x = $_ for keys %h },
sort => sub { $x = $_ for sort keys %h },
values => sub { $x = $_ for values %h },
keach => sub { 1 while $x = each %h },
veach => sub { 1 while $x = (each %h)[1] },
};

Why would this be 'a fair test' when the keys are copied for no
particular reason while no attempt is made to determine the values
except for the 'values' and 'each in list context' cases? Iterating
over the keys of a hash while not looking at the values associated
with those keys at all seems to be a rather bizarre idea of a use
case.

Willem · May 14, 2013

Rainer Weikusat wrote:
) Why would this be 'a fair test' when the keys are copied for no
) particular reason while no attempt is made to determine the values
) except for the 'values' and 'each in list context' cases? Iterating
) over the keys of a hash while not looking at the values associated
) with those keys at all seems to be a rather bizarre idea of a use
) case.

Perl doesn't have a 'set' type, and typically a hash is used for that, and
that is a perfectly legitimate use case for using only the keys of a hash.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Dr.Ruud · May 14, 2013

I'm not sure why
'values' is cheaper than 'keys', but I suspect it has something to do
with the fact that hash keys are shared.

perldoc -f keys:

The returned values are copies of the original keys in the hash, so
modifying them will not affect the original hash.

perldoc -f values:

Note that the values are not copied, which means modifying them will
modify the contents of the hash

Rainer Weikusat · May 14, 2013

Willem said:
Rainer Weikusat wrote:
) Why would this be 'a fair test' when the keys are copied for no
) particular reason while no attempt is made to determine the values
) except for the 'values' and 'each in list context' cases? Iterating
) over the keys of a hash while not looking at the values associated
) with those keys at all seems to be a rather bizarre idea of a use
) case.

Perl doesn't have a 'set' type, and typically a hash is used for that, and
that is a perfectly legitimate use case for using only the keys of a
hash.

While you're of course right on that, I regard this is 'rather
bizarre' use case for a hash.

Rainer Weikusat · May 14, 2013

Dave Saville said:
]

use Benchmark qw(cmpthese);

Click to expand...

[...]

[...]

===
100 keys
===
Rate sorted_keys scalar_each list_each keys
values
sorted_keys 5715/s -- -28% -36% -46%
-85%
scalar_each 7956/s 39% -- -10% -25%
-80%
list_each 8863/s 55% 11% -- -16%
-77%
keys 10602/s 85% 33% 20% --
-73%
values 39161/s 585% 392% 342% 269%

The first column is the 'speed' in terms of 'executions per second',
the other columns show the 'speed differences' of the code of the
current row relative to the others, expressed as percentage of the
speed of the 'column algorithm'. Eg, for 'keys', the sorted_keys
column says '85%'. This is calculated as

int((10602 - 5715) / 5715 * 100)

Rainer Weikusat · May 14, 2013

[...]

In any case, I thought the purpose here was to benchmark 'for (keys)'
and 'while (each)' as methods of iterating over a hash. The difference
between them is so small that a single extra assignment, or an
assignment that copies a string rather than a number, will completely
swamp the difference you are trying to measure, at least until you get
to the point where the extra memory allocated by keys causes the process
to start swapping.

Testing just this, the result is (as I already knew from a past
thread) that keys is (on the computer where I tested this)
significantly faster than each for hashes with up to 10,000 keys.

----------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-5,
{
keys => sub {
1 for keys(%h);
},

each => sub {
my $k;

1 while $k = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
-----------------

I'm somewhat uncertain what "it is possible to keep adding unrelated
code to each example until that dominates the execution time so
overwhelmingly that this difference can't be measured anymore" is
supposed to communicate in this context ...

George Mpouras · May 15, 2013

Î£Ï„Î¹Ï‚ 14/5/2013 16:37, Î¿/Î· Dave Saville ÎÎ³ÏÎ±ÏˆÎµ:

I am trying to do inner and outer joins amongst other things. The

if you explain what to you mean with "inner and outer joins" I could
provide a small piece of code.

Rainer Weikusat · May 15, 2013

Ben Morrow said:
Quoth Rainer Weikusat said:

Ben Morrow <[email protected]> writes:

Click to expand...

[...]

Testing just this, the result is (as I already knew from a past
thread) that keys is (on the computer where I tested this)
significantly faster than each for hashes with up to 10,000 keys.

----------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-5,
{
keys => sub {
1 for keys(%h);
},

each => sub {
my $k;

1 while $k = each(%h);

Click to expand...

Did you actually read what I wrote? That 'each' benchmark does an extra
assignment, so of *course* it's slower.

The 'each' benchmark does not do 'an extra assignment': In order to
use the key (similar to the earlier one which purposely included an
operation extracting the value in every subroutine iterating over the
keys) in a loop, it has to be stored somewhere (aka 'assigned to
something') which is part of the cost of using each, whereas (term?)
keys can be used with for like any other 'term' (term?) returning a
list. Because of this

1. If iterating over the keys and (usually) examining the values is
desired, keys is the sensible choice in almost all cases and each
becomes the sensible choice once the hashes become 'large'.

2. If iterating over the values is sufficient, values is the way to
go, except possibly for extremly large hashes (> 1,000,000 keys for
this example).

3. If a predictable 'iteration order' is desired/ required, there's
not much which can be done except sorting the keylist beforehand (But
when comparing hashes, this can be avoided by iterating over the keys
of one hash, looking for each key in the other hash and deleteing it
if it was found. If the loop didn't terminate early because of a
mismatch, whatever is left in the second after it has was stuff not
contained in the first).

Quick sort algorithm	1	Feb 22, 2023
Some sort questions - especially hashes	4	Oct 10, 2013
Sorting hash of hashes	3	Nov 21, 2011
FAQ 4.56 How do I merge two hashes?	2	Mar 9, 2011
ordered hashes	13	Dec 9, 2008
Iterating over an Array of Hashes	14	May 3, 2011
FAQ 4.62 What's the difference between "delete" and "undef" with hashes?	0	Feb 5, 2011
FAQ 4.65 How can I get the unique keys from two hashes?	0	Jan 6, 2011

Iterating hashes

Dave Saville

Peter Makholm

Rainer Weikusat

Dave Saville

Rainer Weikusat

Willem

Dr.Ruud

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

George Mpouras

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads