Get an arbitrary hash key, quickly.

X

xhoster

nolo contendere said:
On Jan 25, 12:32=A0pm, (e-mail address removed) wrote:
=2E..

From this I infer that you are engaging in parallel processing. Is
this correct?

I have run into these constructs in parallel processing, but that was not
the immediate case here. I was computing a single-linkage hierarchical
clustering tree from a distance matrix that is way too large for the COTS
software I have access to handle. I've also ran into in certain traversals
of extremely large graphs.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
N

nolo contendere

I have run into these constructs in parallel processing, but that was not
the immediate case here.  I was computing a single-linkage hierarchical
clustering tree from a distance matrix that is way too large for the COTS
software I have access to handle.  I've also ran into in certain traversals
of extremely large graphs.


yeah, well...that was my second guess.
 
C

comp.llang.perl.moderated

I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash

};

Well, except that this gets really slow on large hashes.

I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.

If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:

while (%hash) { # does not reset iterator
my $to_do=each %hash;
next unless defined $to_do; # not a real hash key,
# signals end of iterator
## do stuff with $to_do, which might make new entries in %hash

};

If my speculation on the internals is right, then this would still
perform poorly if the hash first grows very large, then shrinks to
be only a few elements, but those last few elements keep triggering
the addition of just one more element each time. In my case, that
shouldn't be a problem.

Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?

Do you need 'each' since values don't seem
to be retrieved...Some simple benchmarks
suggest just looping over the keys would
be quite a bit faster if that's the case:

QUEUE: {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
redo QUEUE if keys %hash;
}
 
N

nolo contendere

Do you need 'each' since values don't seem
to be retrieved...Some simple benchmarks
suggest just looping over the keys would
be quite a bit faster if that's the case:

QUEUE: {
  foreach my $to_do (keys %hash) {
     delete $hash{$to_do};
     ## do stuff with $to_do, which might
        # make new entries in %hash
  }
  redo QUEUE if keys %hash;

}

perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
confused if
you add or remove elements within the loop body, for example
with
"splice". So don't do that.

Xho taught me that one :).
 
C

comp.llang.perl.moderated

On Jan 28, 5:52 pm, "comp.llang.perl.moderated" <c...@blv-


...




perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
confused if
you add or remove elements within the loop body, for example
with
"splice". So don't do that.

Yes, that's true. Even delete's can be problematic
unless iterating with 'each' I think. But the
original code Xho demo'ed was adding elements
during the loop as well... so I'm not sure how
that could ever work..
 
N

nolo contendere

Yes, that's true.  Even delete's can be problematic
unless iterating with 'each' I think. But the
original code Xho demo'ed was adding elements
during the loop as well... so I'm not sure how
that could ever work..

Those were 'while' loops, which don't suffer from the same issue.
 
C

comp.llang.perl.moderated

Those were 'while' loops, which don't suffer from the same issue.

No, I believe it's not just a 'while vs. foreach'
issue but rather ensuring that the iterator's reset
before the next each call.
 
B

Ben Morrow

Quoth nolo contendere said:
perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
^^^^^^^^^^^
It isn't. keys returns a list. What *is* true in this case is that if
any entries you haven't got to yet are deleted from the hash, they will
still be in for's list and will be returned anyway; since that isn't the
case here, it doesn't matter.

However, I would have thought that as the number of keys gets larger,
this get slower, since it has to build a complete list of the keys each
time through QUEUE. Let's see...

#!/usr/bin/perl -l

use strict;
use warnings;

use Benchmark qw/cmpthese/;

my %h =
map { ($_, 1) }
map { join '', map { chr rand 256 } 1..10 }
1..100_000;

cmpthese -5, {
keys => sub {
for my $x (keys %h) { 1; }
},
leach => sub {
while (my ($k, $v) = each %h) { 1; }
},
seach => sub {
while (my $k = each %h) { 1; }
},
};

__END__

Rate leach keys seach
leach 3.76/s -- -28% -30%
keys 5.21/s 38% -- -3%
seach 5.37/s 43% 3% --

Nope, I'm wrong even on large hashes, but scalar each is faster again,
though even that seems to depend on your malloc: FreeBSD's perl is built
with perl's malloc by default, and a version built with the native
malloc has keys > seach > leach. Hmmm. :)

Ben
 
N

nolo contendere

                             ^^^^^^^^^^^
It isn't. keys returns a list. What *is* true in this case is that if
any entries you haven't got to yet are deleted from the hash, they will

hmm, I thought it was unsafe to delete any entries other than the one
just accessed.
still be in for's list and will be returned anyway; since that isn't the
case here, it doesn't matter.

right. my understanding is that for (and foreach) take a snapshot of
the list, and iterate through that snapshot.
However, I would have thought that as the number of keys gets larger,
this get slower, since it has to build a complete list of the keys each
time through QUEUE. Let's see...

wouldn't every subsequent list be smaller for the most part? unless
"do stuff" generated more keys than were deleted on average.
 
B

Ben Morrow

Quoth nolo contendere said:
hmm, I thought it was unsafe to delete any entries other than the one
just accessed.

No, this is a misconception (though one that is somewhat supported by
the docs). See http://blog.plover.com/prog/perl/undefined.html#3 .
Deleting the current entry is actually the potentially-unsafe case, but
perl special-cases it for you so it works correctly (the entry has to be
deleted lazily, after the iterator has moved on to the next entry).

In any case, none of this applies to

for (keys %h) { ... }

keys is in list context, so the complete list of keys is generated
before the loop even starts iterating.
right. my understanding is that for (and foreach) take a snapshot of
the list, and iterate through that snapshot.

Not quite; keys returns a list which is a snapshot of the current set of
keys. What you do with that list afterwards is irrelevant.
wouldn't every subsequent list be smaller for the most part? unless
"do stuff" generated more keys than were deleted on average.

No, that wasn't what I meant. I meant 'as the initial set of keys
becomes large, perhaps building a complete list of that set becomes
inefficient'. It seems it doesn't.

Ben
 
X

xhoster

nolo contendere said:
perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
confused if
you add or remove elements within the loop body, for example
with
"splice". So don't do that.

But in this case, no part of the LIST is an array. keys %hash is not
an array.


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

comp.llang.perl.moderated said:
Do you need 'each' since values don't seem
to be retrieved...Some simple benchmarks
suggest just looping over the keys would
be quite a bit faster if that's the case:

QUEUE: {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
redo QUEUE if keys %hash;
}

I like it. It does have a 2nd caching structure, but that structure is
implicit and entirely managed by perl as part of the foreach loop. I might
change the outer loop syntax somewhat, as I prefer to avoid labels
whenever possible.

while (keys %hash) {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
}
(Causes one extra scalar "keys %hash", but that shouldn't be a problem)

The "do stuff" can only add new entries, not remove entries (other than
$to_do itself) without causing problems. Deleting wasn't part of the
original specification, and that limitation is not a problem in this
particular case.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

nolo contendere said:
hmm, I thought it was unsafe to delete any entries other than the one
just accessed.

That is with "each %hash", not with "keys %hash".
right. my understanding is that for (and foreach) take a snapshot of
the list, and iterate through that snapshot.

Except when the LIST of the foreach has an array in it, then it doesn't
take a snapshot--it does something else (presumably as an optimization),
the details of which are not documented but the effects of which are
(weirdness when changing the array).

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

Ben Morrow said:
^^^^^^^^^^^
It isn't. keys returns a list. What *is* true in this case is that if
any entries you haven't got to yet are deleted from the hash, they will
still be in for's list and will be returned anyway; since that isn't the
case here, it doesn't matter.

However, I would have thought that as the number of keys gets larger,
this get slower, since it has to build a complete list of the keys each
time through QUEUE.

I'm sure you already know this, but just be clear for others, I think the
potential slowness you hypothetically referring to is just a constant
multiplier of slowness (and a rather small one at that), right? The
slowness that I originally posted about was of a different
sort--pathological slowness that seems to be around O(N**2) when it really
"should" be O(N).

Once I take care of the O(N**2)->O(N), I'm not so interested in
microoptimizing the rest, except as a theoretical curiosity.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
B

Ben Morrow

Quoth (e-mail address removed):
while (keys %hash) {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
}
(Causes one extra scalar "keys %hash", but that shouldn't be a problem)

The "do stuff" can only add new entries, not remove entries (other than
$to_do itself) without causing problems. Deleting wasn't part of the
original specification, and that limitation is not a problem in this
particular case.

Fixing that is just a matter of

delete $hash{$to_do} or next;

isn't it? Of course, if you delete lots of entries you're going to waste
time spinning round that loop.

Ben
 
C

comp.llang.perl.moderated

I like it. It does have a 2nd caching structure, but that structure is
implicit and entirely managed by perl as part of the foreach loop. I might
change the outer loop syntax somewhat, as I prefer to avoid labels
whenever possible.

while (keys %hash) {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}}

(Causes one extra scalar "keys %hash", but that shouldn't be a problem)

The "do stuff" can only add new entries, not remove entries (other than
$to_do itself) without causing problems. Deleting wasn't part of the
original specification, and that limitation is not a problem in this
particular case.

Memory might be the only issue then. And it's
really nice to have Ben explain how it really
all works :)
 
N

nolo contendere

No, this is a misconception (though one that is somewhat supported by
the docs). Seehttp://blog.plover.com/prog/perl/undefined.html#3.
Deleting the current entry is actually the potentially-unsafe case, but
perl special-cases it for you so it works correctly (the entry has to be
deleted lazily, after the iterator has moved on to the next entry).

cool!

In any case, none of this applies to

    for (keys %h) { ... }

keys is in list context, so the complete list of keys is generated
before the loop even starts iterating.



Not quite; keys returns a list which is a snapshot of the current set of
keys. What you do with that list afterwards is irrelevant.

right, that was my understanding.
No, that wasn't what I meant. I meant 'as the initial set of keys
becomes large, perhaps building a complete list of that set becomes
inefficient'. It seems it doesn't.

oh.
 
N

nolo contendere

That is with "each %hash", not with "keys %hash".





Except when the LIST of the foreach has an array in it, then it doesn't
take a snapshot--it does something else (presumably as an optimization),
the details of which are not documented but the effects of which are
(weirdness when changing the array).

hmm, could you, or anyone with >= your knowledge about this subject
expound?
 
X

xhoster

nolo contendere said:
hmm, could you, or anyone with >=3D your knowledge about this subject
expound?

I think the link to plover that Ben Morrow posted already did that.
It isn't actually weird, I just treat as weird because it isn't documented
and so could in theory do something different in other versions.

Well, OK, it is weird, because the docs "If any part of LIST is an array"
but it seems that, to get the behavior shown in the plover link, the
entirety of the LIST has to be an array. if you do something like:

foreach ("foo",@x) {

Then it appears, at least on my version, that it does in fact take a
snapshot of @x (more properly, a snapshot of the addresses to the elements
in @x) and any splices made into it during the loop have no effect on the
iteration.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top