Iterating hashes

Discussion in 'Perl Misc' started by Dave Saville, May 14, 2013.

  1. Dave Saville

    Dave Saville Guest

    If I have

    foreach (sort keys %hash)

    Then I know that there is a posible performance hit as the keys are
    all extracted and then sorted. But is this because of the sort or are
    they always all extracted first? If the latter is true then how do you
    iterate over a hash without taking the hit?

    I am having some problems with a script that has two hashes upon which
    I am trying to do inner and outer joins amongst other things. The
    hashes are roughly the same size and have over 60,000 keys the
    majority of the keys have a length of approx 70 characters. The hash
    values are a three element array: A mixed case copy of the key and two
    integers.

    --
    Regards
    Dave Saville
     
    Dave Saville, May 14, 2013
    #1
    1. Advertising

  2. "Dave Saville" <> writes:

    > Then I know that there is a posible performance hit as the keys are
    > all extracted and then sorted. But is this because of the sort or are
    > they always all extracted first? If the latter is true then how do you
    > iterate over a hash without taking the hit?


    You kan use the each() function in a while loop:

    while (($key, $value) = each %hash) {
    print $key, "\n";
    }

    See the relevant part of the perlfunc manual page or
    http://perldoc.perl.org/functions/each.html

    //Makholm
     
    Peter Makholm, May 14, 2013
    #2
    1. Advertising

  3. Peter Makholm <> writes:
    > "Dave Saville" <> writes:
    >
    >> Then I know that there is a posible performance hit as the keys are
    >> all extracted and then sorted. But is this because of the sort or are
    >> they always all extracted first? If the latter is true then how do you
    >> iterate over a hash without taking the hit?

    >
    > You kan use the each() function in a while loop:
    >
    > while (($key, $value) = each %hash) {
    > print $key, "\n";
    > }


    There's actually a variety of options and depending on the number of
    pairs in the hash, they perform differently ('values' came out first
    everwhere for the tests I made).

    ----------------------------
    use Benchmark qw(cmpthese);

    sub traversal_bench
    {
    my %h = map { $_, 1; } 0 .. $_[0];

    print("\n===\n$_[0] keys\n===\n");

    cmpthese(-4,
    {
    keys => sub {
    my $v;

    $v = $h{$_} for keys(%h);
    },

    sorted_keys => sub {
    my $v;

    $v = $h{$_} for sort(keys(%h));
    },

    values => sub {
    1 for values(%h);
    },

    scalar_each => sub {
    my ($v, $k);

    $v = $h{$k} while $k = each(%h);
    },

    list_each => sub {
    my ($v, $k);

    1 while ($k, $v) = each(%h);
    }});
    }

    traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
     
    Rainer Weikusat, May 14, 2013
    #3
  4. Dave Saville

    Dave Saville Guest

    On Tue, 14 May 2013 14:49:46 UTC, Rainer Weikusat
    <> wrote:

    > Peter Makholm <> writes:
    > > "Dave Saville" <> writes:
    > >
    > >> Then I know that there is a posible performance hit as the keys are
    > >> all extracted and then sorted. But is this because of the sort or are
    > >> they always all extracted first? If the latter is true then how do you
    > >> iterate over a hash without taking the hit?

    > >
    > > You kan use the each() function in a while loop:
    > >
    > > while (($key, $value) = each %hash) {
    > > print $key, "\n";
    > > }

    >
    > There's actually a variety of options and depending on the number of
    > pairs in the hash, they perform differently ('values' came out first
    > everwhere for the tests I made).
    >
    > ----------------------------
    > use Benchmark qw(cmpthese);
    >
    > sub traversal_bench
    > {
    > my %h = map { $_, 1; } 0 .. $_[0];
    >
    > print("\n===\n$_[0] keys\n===\n");
    >
    > cmpthese(-4,
    > {
    > keys => sub {
    > my $v;
    >
    > $v = $h{$_} for keys(%h);
    > },
    >
    > sorted_keys => sub {
    > my $v;
    >
    > $v = $h{$_} for sort(keys(%h));
    > },
    >
    > values => sub {
    > 1 for values(%h);
    > },
    >
    > scalar_each => sub {
    > my ($v, $k);
    >
    > $v = $h{$k} while $k = each(%h);
    > },
    >
    > list_each => sub {
    > my ($v, $k);
    >
    > 1 while ($k, $v) = each(%h);
    > }});
    > }
    >
    > traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;


    ===
    100 keys
    ===
    Rate sorted_keys scalar_each list_each keys
    values
    sorted_keys 5715/s -- -28% -36% -46%
    -85%
    scalar_each 7956/s 39% -- -10% -25%
    -80%
    list_each 8863/s 55% 11% -- -16%
    -77%
    keys 10602/s 85% 33% 20% --
    -73%
    values 39161/s 585% 392% 342% 269%
    --

    Care to explain the numbers please?
    --
    Regards
    Dave Saville
     
    Dave Saville, May 14, 2013
    #4
  5. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Peter Makholm <> writes:
    >> > "Dave Saville" <> writes:
    >> >
    >> >> Then I know that there is a posible performance hit as the keys are
    >> >> all extracted and then sorted. But is this because of the sort or are
    >> >> they always all extracted first? If the latter is true then how do you
    >> >> iterate over a hash without taking the hit?
    >> >
    >> > You kan use the each() function in a while loop:
    >> >
    >> > while (($key, $value) = each %hash) {
    >> > print $key, "\n";
    >> > }

    >>
    >> There's actually a variety of options and depending on the number of
    >> pairs in the hash, they perform differently ('values' came out first
    >> everwhere for the tests I made).
    >>
    >> ----------------------------
    >> use Benchmark qw(cmpthese);
    >>
    >> sub traversal_bench
    >> {
    >> my %h = map { $_, 1; } 0 .. $_[0];
    >>
    >> print("\n===\n$_[0] keys\n===\n");
    >>
    >>
    >> sorted_keys => sub {
    >> my $v;
    >>
    >> $v = $h{$_} for sort(keys(%h));
    >> },
    >>
    >> values => sub {
    >> 1 for values(%h);
    >> },

    >
    > That's hardly a fair comparison. In fact, 'values' coming out faster is
    > a red herring as well: it's only happening because the values are all 1
    > which is much faster to copy than a string.
    >
    > With a fairer test like
    >
    > use Benchmark qw/cmpthese/;
    >
    > my %h = map +("$_", "$_"), 1..60_000;
    > my $x;
    >
    > cmpthese -5, {
    > keys => sub { $x = $_ for keys %h },
    > sort => sub { $x = $_ for sort keys %h },
    > values => sub { $x = $_ for values %h },
    > keach => sub { 1 while $x = each %h },
    > veach => sub { 1 while $x = (each %h)[1] },
    > };


    Why would this be 'a fair test' when the keys are copied for no
    particular reason while no attempt is made to determine the values
    except for the 'values' and 'each in list context' cases? Iterating
    over the keys of a hash while not looking at the values associated
    with those keys at all seems to be a rather bizarre idea of a use
    case.
     
    Rainer Weikusat, May 14, 2013
    #5
  6. Dave Saville

    Willem Guest

    Rainer Weikusat wrote:
    ) Why would this be 'a fair test' when the keys are copied for no
    ) particular reason while no attempt is made to determine the values
    ) except for the 'values' and 'each in list context' cases? Iterating
    ) over the keys of a hash while not looking at the values associated
    ) with those keys at all seems to be a rather bizarre idea of a use
    ) case.

    Perl doesn't have a 'set' type, and typically a hash is used for that, and
    that is a perfectly legitimate use case for using only the keys of a hash.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, May 14, 2013
    #6
  7. Dave Saville

    Dr.Ruud Guest

    On 14/05/2013 18:40, Ben Morrow wrote:

    > I'm not sure why
    > 'values' is cheaper than 'keys', but I suspect it has something to do
    > with the fact that hash keys are shared.


    perldoc -f keys:

    The returned values are copies of the original keys in the hash, so
    modifying them will not affect the original hash.


    perldoc -f values:

    Note that the values are not copied, which means modifying them will
    modify the contents of the hash

    --
    Ruud
     
    Dr.Ruud, May 14, 2013
    #7
  8. Willem <> writes:
    > Rainer Weikusat wrote:
    > ) Why would this be 'a fair test' when the keys are copied for no
    > ) particular reason while no attempt is made to determine the values
    > ) except for the 'values' and 'each in list context' cases? Iterating
    > ) over the keys of a hash while not looking at the values associated
    > ) with those keys at all seems to be a rather bizarre idea of a use
    > ) case.
    >
    > Perl doesn't have a 'set' type, and typically a hash is used for that, and
    > that is a perfectly legitimate use case for using only the keys of a
    > hash.


    While you're of course right on that, I regard this is 'rather
    bizarre' use case for a hash.
     
    Rainer Weikusat, May 14, 2013
    #8
  9. "Dave Saville" <> writes:
    > On Tue, 14 May 2013 14:49:46 UTC, Rainer Weikusat
    > <> wrote:


    [...]

    >> use Benchmark qw(cmpthese);


    [...]

    >> cmpthese(-4,


    [...]

    > ===
    > 100 keys
    > ===
    > Rate sorted_keys scalar_each list_each keys
    > values
    > sorted_keys 5715/s -- -28% -36% -46%
    > -85%
    > scalar_each 7956/s 39% -- -10% -25%
    > -80%
    > list_each 8863/s 55% 11% -- -16%
    > -77%
    > keys 10602/s 85% 33% 20% --
    > -73%
    > values 39161/s 585% 392% 342% 269%
    > --
    >
    > Care to explain the numbers please?


    The first column is the 'speed' in terms of 'executions per second',
    the other columns show the 'speed differences' of the code of the
    current row relative to the others, expressed as percentage of the
    speed of the 'column algorithm'. Eg, for 'keys', the sorted_keys
    column says '85%'. This is calculated as

    int((10602 - 5715) / 5715 * 100)
     
    Rainer Weikusat, May 14, 2013
    #9
  10. Ben Morrow <> writes:

    [...]

    > In any case, I thought the purpose here was to benchmark 'for (keys)'
    > and 'while (each)' as methods of iterating over a hash. The difference
    > between them is so small that a single extra assignment, or an
    > assignment that copies a string rather than a number, will completely
    > swamp the difference you are trying to measure, at least until you get
    > to the point where the extra memory allocated by keys causes the process
    > to start swapping.


    Testing just this, the result is (as I already knew from a past
    thread) that keys is (on the computer where I tested this)
    significantly faster than each for hashes with up to 10,000 keys.

    ----------------
    use Benchmark qw(cmpthese);

    sub traversal_bench
    {
    my %h = map { $_, 1; } 0 .. $_[0];


    print("\n===\n$_[0] keys\n===\n");

    cmpthese(-5,
    {
    keys => sub {
    1 for keys(%h);
    },

    each => sub {
    my $k;

    1 while $k = each(%h);
    }});
    }

    traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
    -----------------

    I'm somewhat uncertain what "it is possible to keep adding unrelated
    code to each example until that dominates the execution time so
    overwhelmingly that this difference can't be measured anymore" is
    supposed to communicate in this context ...
     
    Rainer Weikusat, May 14, 2013
    #10
  11. Στις 14/5/2013 16:37, ο/η Dave Saville έγÏαψε:
    > I am trying to do inner and outer joins amongst other things. The


    if you explain what to you mean with "inner and outer joins" I could
    provide a small piece of code.
     
    George Mpouras, May 15, 2013
    #11
  12. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:


    [...]

    >> Testing just this, the result is (as I already knew from a past
    >> thread) that keys is (on the computer where I tested this)
    >> significantly faster than each for hashes with up to 10,000 keys.
    >>
    >> ----------------
    >> use Benchmark qw(cmpthese);
    >>
    >> sub traversal_bench
    >> {
    >> my %h = map { $_, 1; } 0 .. $_[0];
    >>
    >>
    >> print("\n===\n$_[0] keys\n===\n");
    >>
    >> cmpthese(-5,
    >> {
    >> keys => sub {
    >> 1 for keys(%h);
    >> },
    >>
    >> each => sub {
    >> my $k;
    >>
    >> 1 while $k = each(%h);

    >
    > Did you actually read what I wrote? That 'each' benchmark does an extra
    > assignment, so of *course* it's slower.


    The 'each' benchmark does not do 'an extra assignment': In order to
    use the key (similar to the earlier one which purposely included an
    operation extracting the value in every subroutine iterating over the
    keys) in a loop, it has to be stored somewhere (aka 'assigned to
    something') which is part of the cost of using each, whereas (term?)
    keys can be used with for like any other 'term' (term?) returning a
    list. Because of this

    1. If iterating over the keys and (usually) examining the values is
    desired, keys is the sensible choice in almost all cases and each
    becomes the sensible choice once the hashes become 'large'.

    2. If iterating over the values is sufficient, values is the way to
    go, except possibly for extremly large hashes (> 1,000,000 keys for
    this example).

    3. If a predictable 'iteration order' is desired/ required, there's
    not much which can be done except sorting the keylist beforehand (But
    when comparing hashes, this can be avoided by iterating over the keys
    of one hash, looking for each key in the other hash and deleteing it
    if it was found. If the loop didn't terminate early because of a
    mismatch, whatever is left in the second after it has was stuff not
    contained in the first).
     
    Rainer Weikusat, May 15, 2013
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. carl
    Replies:
    5
    Views:
    2,513
    James Kanze
    Nov 25, 2009
  2. Steven Hirsch

    Iterating over a hash of hash of hashes

    Steven Hirsch, Aug 19, 2008, in forum: Ruby
    Replies:
    0
    Views:
    162
    Steven Hirsch
    Aug 19, 2008
  3. Peter Hicks

    Iterating over an Array of Hashes

    Peter Hicks, May 3, 2011, in forum: Ruby
    Replies:
    14
    Views:
    265
    Peter Hicks
    Jun 11, 2011
  4. Jeff Thies

    iterating through hash of hashes

    Jeff Thies, Aug 30, 2003, in forum: Perl Misc
    Replies:
    7
    Views:
    178
    Benjamin Goldberg
    Sep 2, 2003
  5. Tim O'Donovan

    Hash of hashes, of hashes, of arrays of hashes

    Tim O'Donovan, Oct 27, 2005, in forum: Perl Misc
    Replies:
    5
    Views:
    237
Loading...

Share This Page