Get an arbitrary hash key, quickly.

X

xhoster

I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};

Well, except that this gets really slow on large hashes.

I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.

If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:


while (%hash) { # does not reset iterator
my $to_do=each %hash;
next unless defined $to_do; # not a real hash key,
# signals end of iterator
## do stuff with $to_do, which might make new entries in %hash
};

If my speculation on the internals is right, then this would still
perform poorly if the hash first grows very large, then shrinks to
be only a few elements, but those last few elements keep triggering
the addition of just one more element each time. In my case, that
shouldn't be a problem.

Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?

Thanks,

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
N

nolo contendere

I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it.  Either is pretty simple and
efficient with a Perl array, and either would suffice.  But I want the
queue to not hold duplicate entries.  I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered.  But why keep parallel data structures?  Just use the hash asthe
queue:

while (key %hash) {
  my $to_do=each %hash;
  delete $hash{$to_do};
  ## do stuff with $to_do, which might make new entries in %hash

};

if "do stuff" might make entries in %hash, how do you guarantee you
won't be operating on a duplicate entry, since you delete
$hash{$to_do}?
 
X

xhoster

nolo contendere said:
if "do stuff" might make entries in %hash, how do you guarantee you
won't be operating on a duplicate entry, since you delete
$hash{$to_do}?

I want to guarantee that duplicate entries won't be in the queue *at the
same time*. Whether they can be re-entered at a later point is up to
the details of "do stuff", which vary from project to project and which
I am intentionally avoiding.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
U

Uri Guttman

x> I need a work queue, but I don't really care about the order (LIFO, FIFO,
x> random) in which things come out of it. Either is pretty simple and
x> efficient with a Perl array, and either would suffice. But I want the
x> queue to not hold duplicate entries. I could use an array as a stack or
x> queue, with a parallel hash to be checked to prevent duplicates from being
x> entered. But why keep parallel data structures? Just use the hash as the
x> queue:

why not use the hash/array pair? and how do you generate dup work
request keys anyway? why not use a ref to the work thing as the key as
that will be unique. then an array will be fine.

x> while (key %hash) {
x> my $to_do=each %hash;

why not do this instead?

while( my $todo = each %hash ) {

#do work
delete $hash{$to_do};
keys %hash ; # reset iterator

the docs say IIRC that keys in a void context will reset the
iterator. keys in a while condition is in a scalar/boolean context and
may just return the number of keys.

and as another poster says, adding hash elements in a loop may change the
iterator order and cause unknown results. you need to do a clean reset
of the hash iterator so the each will work correctly.

x> If I just use scalar %hash instead of scalar keys %hash, then the slow down
x> doesn't occur because "each" scans the buckets from where it left off last
x> time, instead of from the beginning each time. (At first I thought it was
x> scalar keys %hash that was being slow and was going to post about *that*,
x> but then I realized that keys' reseting of the iterator was causing "each"
x> to be slow). But you shouldn't change a hash at the same time you iterate
x> over it because things might get missed. Presumably, anything missed will
x> be found on the next time through the iterator, so this might work without
x> the slowdown:

and you may not see all the elements of the hash as the iterator could
skip over new entries.

the best asnwer is what i said above, do a proper iterator reset after
you process the current work (actually just after the each() call should
work fine too and that would be clearer).

uri
 
X

xhoster

Uri Guttman said:
x> I need a work queue, but I don't really care about the order (LIFO,
FIFO, x> random) in which things come out of it. Either is pretty
simple and x> efficient with a Perl array, and either would suffice.
But I want the x> queue to not hold duplicate entries. I could use an
array as a stack or x> queue, with a parallel hash to be checked to
prevent duplicates from being x> entered. But why keep parallel data
structures? Just use the hash as the x> queue:

why not use the hash/array pair?

Parallel data structures are ugly and easy to screw up because they always
need to be kept in sync. Also, they need to use more memory, which in some
cases may be important. But other than that, there is no reason not to.
That is what I often do, and it works, I just want something neater.

and how do you generate dup work
request keys anyway? why not use a ref to the work thing as the key as
that will be unique. then an array will be fine.

I don't understand what you are saying. What stops an array from
containing many duplicate values, be they references or anything else?
x> while (key %hash) {
x> my $to_do=each %hash;

why not do this instead?

while( my $todo = each %hash ) {

#do work
delete $hash{$to_do};
keys %hash ; # reset iterator

Effectively the same thing, just re-worded. Works, but has the
same slowness issue, presumably for the same reason.
the docs say IIRC that keys in a void context will reset the
iterator. keys in a while condition is in a scalar/boolean context and
may just return the number of keys.

Nah, it resets in all contexts. In a void context, that is all it does.
(i.e. it doesn't build the return list and then through it away, or
something like that.)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
M

Mark Clements

I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};

Well, except that this gets really slow on large hashes.

Have you looked at one of the Heap modules, eg Heap::Simple?

Mark
 
N

nolo contendere

Effectively the same thing, just re-worded.  Works, but has the
same slowness issue, presumably for the same reason.

So if you remove the keys %hash in the above scenario, would the
condition to 'while' evaluate to true after the iterator reached then
end if keys were added? will the iterator "know" that it needs to loop
around again? it's unclear from the perldoc -f each:

When the hash is entirely read, a null array is
returned in list context (which when assigned
produces a false (0) value), and "undef" in scalar
context. The next call to "each" after that will
start iterating again. There is a single iterator
for each hash, shared by all "each", "keys", and
"values" function calls in the program; it can be
reset by reading all the elements from the hash, or
by evaluating "keys HASH" or "values HASH". If you
add or delete elements of a hash while you're
iterating over it, you may get entries skipped or
duplicated, so don't. Exception: It is always safe
to delete the item most recently returned by
"each()", which means that the following code will
work:

while (($key, $value) = each %hash) {
print $key, "\n";
delete $hash{$key}; # This is safe
}

It says undef will be returned if the hash is entirely read. But since
the hash is not empty, is it considered "entirely read"? Also, it says
it's always safe to delete the most recently returned item, but sounds
like it's UNsafe to add elements while iterating.

based on this, Uri's way (which is essentially the same as your
original slow way, if a tad bit cleaner), seems the correct way.
 
A

A. Sinan Unur

I need a work queue, but I don't really care about the order (LIFO,
FIFO, random) in which things come out of it. Either is pretty simple
and efficient with a Perl array, and either would suffice. But I want
the queue to not hold duplicate entries. I could use an array as a
stack or queue, with a parallel hash to be checked to prevent
duplicates from being entered. But why keep parallel data structures?
Just use the hash as the queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};

Well, except that this gets really slow on large hashes.

It is possible I am misunderstanding what you are doing, but I would
have written:

#!/usr/bin/perl

use strict;
use warnings;

my %queue;
$queue{ $_ } = undef for 1 .. 1_000_000;

while ( my @keys = keys %queue ) {
for my $task ( @keys ) {
delete $queue{ $task };
#
# Do something
#
# Add another task with a small probability
#
if ( 0.1 > rand ) {
$queue{ 1 + int(rand 1_000_000) } = undef;
}
}
}


C:\Home\asu1\Src\test\hash_queue> timethis hq

TimeThis : Command Line : hq
TimeThis : Start Time : Thu Jan 24 14:42:36 2008
TimeThis : End Time : Thu Jan 24 14:42:46 2008
TimeThis : Elapsed Time : 00:00:09.312

Sinan
 
N

nolo contendere

(e-mail address removed) wrote in



It is possible I am misunderstanding what you are doing, but I would
have written:

#!/usr/bin/perl

use strict;
use warnings;

my %queue;
$queue{ $_ } = undef for 1 .. 1_000_000;

while ( my @keys = keys %queue ) {
    for my $task ( @keys ) {
        delete $queue{ $task };
        #
        # Do something
        #
        # Add another task with a small probability
        #
        if ( 0.1 > rand ) {
            $queue{ 1 + int(rand 1_000_000) } = undef;
        }
    }

}

C:\Home\asu1\Src\test\hash_queue> timethis hq

TimeThis :  Command Line :  hq
TimeThis :    Start Time :  Thu Jan 24 14:42:36 2008
TimeThis :      End Time :  Thu Jan 24 14:42:46 2008
TimeThis :  Elapsed Time :  00:00:09.312

That was my initial thought as well (well, something like that
anyway), and I think that was Xho's initial thought. But I think he's
trying to devise a solution that doesn't require both %queue and
@keys.
 
X

xhoster

nolo contendere said:
So if you remove the keys %hash in the above scenario, would the
condition to 'while' evaluate to true after the iterator reached then
end if keys were added?

No. Once the iterator reached the end, it will return undef and the loop
will exit. So as far as the loop is concerned, there is no "after"
the iterator reaches the end, because the loop is done.
will the iterator "know" that it needs to loop
around again?

Only if you happen to invoke it once more after it already indicated
the it was done. But the loop, without the reset, doesn't invoke it
again.

it's unclear from the perldoc -f each:

When the hash is entirely read, a null array is
returned in list context (which when assigned
produces a false (0) value), and "undef" in scalar
context. The next call to "each" after that will
start iterating again. There is a single iterator
for each hash, shared by all "each", "keys", and
"values" function calls in the program; it can be
reset by reading all the elements from the hash, or
by evaluating "keys HASH" or "values HASH". If you
add or delete elements of a hash while you're
iterating over it, you may get entries skipped or
duplicated, so don't. Exception: It is always safe
to delete the item most recently returned by
"each()", which means that the following code will
work:

while (($key, $value) =3D each %hash) {
print $key, "\n";
delete $hash{$key}; # This is safe
}

It says undef will be returned if the hash is entirely read. But since
the hash is not empty, is it considered "entirely read"?

Of course. Otherwise, if you did an each without a delete (which
is the most common way to use each), then it would be infinite loop.
Also, it says
it's always safe to delete the most recently returned item, but sounds
like it's UNsafe to add elements while iterating.

If you violate their warning about adding things to the hash while
iterating, then it may be deemed fully read even though some elements have
not been iterated over during this current round of iteration. So if you
violate that warning, then when each returns empty you have to use some
other way to make sure the hash itself is empty.

I'm pretty sure it is safe in the sense that it will not segfault or
anything like that, it is only unsafe towards your assumptions that
everything has been seen.

based on this, Uri's way (which is essentially the same as your
original slow way, if a tad bit cleaner), seems the correct way.

I don't see that. Both methods always reset the iterator between the time
additions are made and the time each is called again. (Well, Uri's doesn't
reset the iterator before the very first each, which presumably comes
after some insert which was done before the loop was reached, but surely
that is not a problem.)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
N

nolo contendere

I don't see that.  Both methods always reset the iterator between the time
additions are made and the time each is called again.  (Well, Uri's doesn't
reset the iterator before the very first each, which presumably comes
after some insert which was done before the loop was reached, but surely
that is not a problem.)


True, the flaw in this algorithm is that it always resets the
iterator. Personally, I like Sinan's implementation.
 
A

A. Sinan Unur


Obviously, I was.

....
That was my initial thought as well (well, something like that
anyway), and I think that was Xho's initial thought. But I think he's
trying to devise a solution that doesn't require both %queue and
@keys.

Thank you for the clarification.

Sinan
 
B

Ben Morrow

Quoth (e-mail address removed):
I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};

Well, except that this gets really slow on large hashes.

I would have thought the obvious method would be

my (@Queue, %done);

while (my $to_do = shift @Queue) {
exists $done{$to_do} and next;
$done{$to_do} = 1;

process_stuff($to_do);
}

obviously, if you wanted to deliberately requeue something you would
need to know that, and delete the hash entry. I realize you're trying to
avoid parallel data structures, but $done is pretty much
self-maintaining, so I don't know if that applies here.
I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.

Presumably you could fix this with

%hash = %hash;

if that would be helpful. Of course, you'll copy everything.
If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:

Presumably also things might get found twice, depending on how the
buckets get reorganized. Since you're always deleting the current
element before calling each again that shouldn't be a problem here.
Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?

Presumably switching to something like BerkeleyDB::BTREE would also fix
this problem, since btrees are designed to be iterated over. I don't
know how much grief the tie overhead would cause in this case;
presumably an implementation could be made that used U/u magic directly
rather than tie magic, and so avoided the method calls.

Ben
 
N

nolo contendere

Presumably you could fix this with ...
Presumably also things might get found twice, depending on how the ...

Presumably switching to something like BerkeleyDB::BTREE would also fix ...
presumably an implementation could be made that used U/u magic directly
rather than tie magic, and so avoided the method calls.

You're very presumptuous ;-).
 
X

xhoster

A. Sinan Unur said:
It is possible I am misunderstanding what you are doing, but I would
have written:

#!/usr/bin/perl

use strict;
use warnings;

my %queue;
$queue{ $_ } = undef for 1 .. 1_000_000;

while ( my @keys = keys %queue ) {
for my $task ( @keys ) {
delete $queue{ $task };
#
# Do something
#
# Add another task with a small probability
#
if ( 0.1 > rand ) {
$queue{ 1 + int(rand 1_000_000) } = undef;
}
}
}

Now we've changed from an extra parallel structure to an extra caching
structure, which of course works but has the same inelegance. (Grepping
back through years of code, I see that I've actually used this method many
years ago, too, but from the comments it looks like I didn't understand why
the original was slow, so I regarded it as trial and error magic.) It does
require that "do something" can only add and not delete things from %queue,
which is fine with my original specification but lacks maximal flexibility.

There are many ways to work around this degradation in performance, but it
seems like all of them impose other limitations or inelegances. Oh well.
if these things were too easy I guess anyone could do it and I'd be out of
a job :)


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

Ben Morrow said:
Quoth (e-mail address removed):

I would have thought the obvious method would be

my (@queue, %done);

while (my $to_do = shift @queue) {
exists $done{$to_do} and next;
$done{$to_do} = 1;

process_stuff($to_do);
}

The problem here is that, under some uses, @Queue can accumulate an fatal
number of replicates. While %done prevents them from being processed
more than once, they are still stored more than once. If each of 10_000
work-items build up on the queue 10_000 times each, that becomes a storage
problem. So you need to use %done to guard against adding replicates into
@Queue to defend memory usage.

But replicates can still build up to unacceptable levels before the first
instance of one ever gets $done{}. So then you need another hash,
%queued_but_not_done, to reject duplicate entries at the earliest stage.
And then you say "This is a mess, let's just use one hash", then you
discover that "each" performs poorly on a shrinking hash, and then you post
a message here. :)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
N

nolo contendere

On Jan 25, 12:32 pm, (e-mail address removed) wrote:
...
But replicates can still build up to unacceptable levels before the first
instance of one ever gets $done{}.  So then you need another hash,
%queued_but_not_done, to reject duplicate entries at the earliest stage.

From this I infer that you are engaging in parallel processing. Is
this correct?
 
C

comp.llang.perl.moderated

I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash

};

Well, except that this gets really slow on large hashes.

I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.

If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:

while (%hash) { # does not reset iterator
my $to_do=each %hash;
next unless defined $to_do; # not a real hash key,
# signals end of iterator
## do stuff with $to_do, which might make new entries in %hash

};

If my speculation on the internals is right, then this would still
perform poorly if the hash first grows very large, then shrinks to
be only a few elements, but those last few elements keep triggering
the addition of just one more element each time. In my case, that
shouldn't be a problem.

Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?

Even with tie slowness, a Tie::IxHash solution would at least be more
straightforward since insertion order is maintained... I think.

tie my %hash, 'Tie::IxHash';
my $idx = 0;

while ( my @queue = keys %hash ) {
my $to_do = $queue[$idx++];
delete $hash{ $to_do };
# do stuff with $to_do, which might make
# new entries in %hash
}
 
C

comp.llang.perl.moderated

On Jan 24, 9:25 am, (e-mail address removed) wrote:


...

Even with tie slowness, a Tie::IxHash solution would at least be more
straightforward since insertion order is maintained... I think.

tie my %hash, 'Tie::IxHash';
my $idx = 0;

while ( my @queue = keys %hash ) {
my $to_do = $queue[$idx++];
delete $hash{ $to_do };
# do stuff with $to_do, which might make
# new entries in %hash

}

Well, it seemed like a good idea a minute ago.
But, you might need to tweak $idx for instance...
Maybe other problems too...

while ( my @Queue = keys %hash ) {
$idx = $#queue if $idx > $#queue;
my $to_do = $queue[$idx++];

delete $hash{ $to_do };
# do stuff with $to_do, which might make
# new entries in %hash

}
 
B

Ben Morrow

Quoth "comp.llang.perl.moderated said:
On Jan 24, 9:25 am, (e-mail address removed) wrote:

Even with tie slowness, a Tie::IxHash solution would at least be more
straightforward since insertion order is maintained... I think.

I have to admit I've never seen the point of Tie::IxHash. It's just the
same as the parallel array/hash Xho'd already rejected, but slower...

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top