Get an arbitrary hash key, quickly.

Discussion in 'Perl Misc' started by xhoster, Jan 24, 2008.

  1. xhoster

    xhoster Guest

    I need a work queue, but I don't really care about the order (LIFO, FIFO,
    random) in which things come out of it. Either is pretty simple and
    efficient with a Perl array, and either would suffice. But I want the
    queue to not hold duplicate entries. I could use an array as a stack or
    queue, with a parallel hash to be checked to prevent duplicates from being
    entered. But why keep parallel data structures? Just use the hash as the
    queue:

    while (key %hash) {
    my $to_do=each %hash;
    delete $hash{$to_do};
    ## do stuff with $to_do, which might make new entries in %hash
    };

    Well, except that this gets really slow on large hashes.

    I suspect the each operator does a linear scan of all the buckets until it
    finds an occupied one. With a hash that used to be big but now isn't (and
    so has lots of empty buckets), this behaves poorly, essentially being N^2
    to empty a large hash.

    If I just use scalar %hash instead of scalar keys %hash, then the slow down
    doesn't occur because "each" scans the buckets from where it left off last
    time, instead of from the beginning each time. (At first I thought it was
    scalar keys %hash that was being slow and was going to post about *that*,
    but then I realized that keys' reseting of the iterator was causing "each"
    to be slow). But you shouldn't change a hash at the same time you iterate
    over it because things might get missed. Presumably, anything missed will
    be found on the next time through the iterator, so this might work without
    the slowdown:


    while (%hash) { # does not reset iterator
    my $to_do=each %hash;
    next unless defined $to_do; # not a real hash key,
    # signals end of iterator
    ## do stuff with $to_do, which might make new entries in %hash
    };

    If my speculation on the internals is right, then this would still
    perform poorly if the hash first grows very large, then shrinks to
    be only a few elements, but those last few elements keep triggering
    the addition of just one more element each time. In my case, that
    shouldn't be a problem.

    Any comments on this code? Is there some less quirky way to do this
    efficiently? Is there a simple (as simple as perl's internals can get)
    way to fix "each" so that it doesn't have this degenerate case?

    Thanks,

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Jan 24, 2008
    #1
    1. Advertisements

  2. if "do stuff" might make entries in %hash, how do you guarantee you
    won't be operating on a duplicate entry, since you delete
    $hash{$to_do}?
     
    nolo contendere, Jan 24, 2008
    #2
    1. Advertisements

  3. xhoster

    xhoster Guest

    I want to guarantee that duplicate entries won't be in the queue *at the
    same time*. Whether they can be re-entered at a later point is up to
    the details of "do stuff", which vary from project to project and which
    I am intentionally avoiding.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Jan 24, 2008
    #3
  4. xhoster

    Uri Guttman Guest

    x> I need a work queue, but I don't really care about the order (LIFO, FIFO,
    x> random) in which things come out of it. Either is pretty simple and
    x> efficient with a Perl array, and either would suffice. But I want the
    x> queue to not hold duplicate entries. I could use an array as a stack or
    x> queue, with a parallel hash to be checked to prevent duplicates from being
    x> entered. But why keep parallel data structures? Just use the hash as the
    x> queue:

    why not use the hash/array pair? and how do you generate dup work
    request keys anyway? why not use a ref to the work thing as the key as
    that will be unique. then an array will be fine.

    x> while (key %hash) {
    x> my $to_do=each %hash;

    why not do this instead?

    while( my $todo = each %hash ) {

    #do work
    delete $hash{$to_do};
    keys %hash ; # reset iterator

    the docs say IIRC that keys in a void context will reset the
    iterator. keys in a while condition is in a scalar/boolean context and
    may just return the number of keys.

    and as another poster says, adding hash elements in a loop may change the
    iterator order and cause unknown results. you need to do a clean reset
    of the hash iterator so the each will work correctly.

    x> If I just use scalar %hash instead of scalar keys %hash, then the slow down
    x> doesn't occur because "each" scans the buckets from where it left off last
    x> time, instead of from the beginning each time. (At first I thought it was
    x> scalar keys %hash that was being slow and was going to post about *that*,
    x> but then I realized that keys' reseting of the iterator was causing "each"
    x> to be slow). But you shouldn't change a hash at the same time you iterate
    x> over it because things might get missed. Presumably, anything missed will
    x> be found on the next time through the iterator, so this might work without
    x> the slowdown:

    and you may not see all the elements of the hash as the iterator could
    skip over new entries.

    the best asnwer is what i said above, do a proper iterator reset after
    you process the current work (actually just after the each() call should
    work fine too and that would be clearer).

    uri
     
    Uri Guttman, Jan 24, 2008
    #4
  5. xhoster

    xhoster Guest

    Parallel data structures are ugly and easy to screw up because they always
    need to be kept in sync. Also, they need to use more memory, which in some
    cases may be important. But other than that, there is no reason not to.
    That is what I often do, and it works, I just want something neater.

    I don't understand what you are saying. What stops an array from
    containing many duplicate values, be they references or anything else?
    Effectively the same thing, just re-worded. Works, but has the
    same slowness issue, presumably for the same reason.
    Nah, it resets in all contexts. In a void context, that is all it does.
    (i.e. it doesn't build the return list and then through it away, or
    something like that.)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Jan 24, 2008
    #5
  6. Have you looked at one of the Heap modules, eg Heap::Simple?

    Mark
     
    Mark Clements, Jan 24, 2008
    #6
  7. So if you remove the keys %hash in the above scenario, would the
    condition to 'while' evaluate to true after the iterator reached then
    end if keys were added? will the iterator "know" that it needs to loop
    around again? it's unclear from the perldoc -f each:

    When the hash is entirely read, a null array is
    returned in list context (which when assigned
    produces a false (0) value), and "undef" in scalar
    context. The next call to "each" after that will
    start iterating again. There is a single iterator
    for each hash, shared by all "each", "keys", and
    "values" function calls in the program; it can be
    reset by reading all the elements from the hash, or
    by evaluating "keys HASH" or "values HASH". If you
    add or delete elements of a hash while you're
    iterating over it, you may get entries skipped or
    duplicated, so don't. Exception: It is always safe
    to delete the item most recently returned by
    "each()", which means that the following code will
    work:

    while (($key, $value) = each %hash) {
    print $key, "\n";
    delete $hash{$key}; # This is safe
    }

    It says undef will be returned if the hash is entirely read. But since
    the hash is not empty, is it considered "entirely read"? Also, it says
    it's always safe to delete the most recently returned item, but sounds
    like it's UNsafe to add elements while iterating.

    based on this, Uri's way (which is essentially the same as your
    original slow way, if a tad bit cleaner), seems the correct way.
     
    nolo contendere, Jan 24, 2008
    #7
  8. It is possible I am misunderstanding what you are doing, but I would
    have written:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my %queue;
    $queue{ $_ } = undef for 1 .. 1_000_000;

    while ( my @keys = keys %queue ) {
    for my $task ( @keys ) {
    delete $queue{ $task };
    #
    # Do something
    #
    # Add another task with a small probability
    #
    if ( 0.1 > rand ) {
    $queue{ 1 + int(rand 1_000_000) } = undef;
    }
    }
    }


    C:\Home\asu1\Src\test\hash_queue> timethis hq

    TimeThis : Command Line : hq
    TimeThis : Start Time : Thu Jan 24 14:42:36 2008
    TimeThis : End Time : Thu Jan 24 14:42:46 2008
    TimeThis : Elapsed Time : 00:00:09.312

    Sinan
     
    A. Sinan Unur, Jan 24, 2008
    #8
  9. That was my initial thought as well (well, something like that
    anyway), and I think that was Xho's initial thought. But I think he's
    trying to devise a solution that doesn't require both %queue and
    @keys.
     
    nolo contendere, Jan 24, 2008
    #9
  10. xhoster

    xhoster Guest

    No. Once the iterator reached the end, it will return undef and the loop
    will exit. So as far as the loop is concerned, there is no "after"
    the iterator reaches the end, because the loop is done.
    Only if you happen to invoke it once more after it already indicated
    the it was done. But the loop, without the reset, doesn't invoke it
    again.

    Of course. Otherwise, if you did an each without a delete (which
    is the most common way to use each), then it would be infinite loop.
    If you violate their warning about adding things to the hash while
    iterating, then it may be deemed fully read even though some elements have
    not been iterated over during this current round of iteration. So if you
    violate that warning, then when each returns empty you have to use some
    other way to make sure the hash itself is empty.

    I'm pretty sure it is safe in the sense that it will not segfault or
    anything like that, it is only unsafe towards your assumptions that
    everything has been seen.

    I don't see that. Both methods always reset the iterator between the time
    additions are made and the time each is called again. (Well, Uri's doesn't
    reset the iterator before the very first each, which presumably comes
    after some insert which was done before the loop was reached, but surely
    that is not a problem.)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Jan 24, 2008
    #10

  11. True, the flaw in this algorithm is that it always resets the
    iterator. Personally, I like Sinan's implementation.
     
    nolo contendere, Jan 24, 2008
    #11
  12. Obviously, I was.

    ....
    Thank you for the clarification.

    Sinan
     
    A. Sinan Unur, Jan 24, 2008
    #12
  13. xhoster

    Ben Morrow Guest

    Quoth :
    I would have thought the obvious method would be

    my (@Queue, %done);

    while (my $to_do = shift @Queue) {
    exists $done{$to_do} and next;
    $done{$to_do} = 1;

    process_stuff($to_do);
    }

    obviously, if you wanted to deliberately requeue something you would
    need to know that, and delete the hash entry. I realize you're trying to
    avoid parallel data structures, but $done is pretty much
    self-maintaining, so I don't know if that applies here.
    Presumably you could fix this with

    %hash = %hash;

    if that would be helpful. Of course, you'll copy everything.
    Presumably also things might get found twice, depending on how the
    buckets get reorganized. Since you're always deleting the current
    element before calling each again that shouldn't be a problem here.
    Presumably switching to something like BerkeleyDB::BTREE would also fix
    this problem, since btrees are designed to be iterated over. I don't
    know how much grief the tie overhead would cause in this case;
    presumably an implementation could be made that used U/u magic directly
    rather than tie magic, and so avoided the method calls.

    Ben
     
    Ben Morrow, Jan 25, 2008
    #13
  14. You're very presumptuous ;-).
     
    nolo contendere, Jan 25, 2008
    #14
  15. xhoster

    xhoster Guest

    Now we've changed from an extra parallel structure to an extra caching
    structure, which of course works but has the same inelegance. (Grepping
    back through years of code, I see that I've actually used this method many
    years ago, too, but from the comments it looks like I didn't understand why
    the original was slow, so I regarded it as trial and error magic.) It does
    require that "do something" can only add and not delete things from %queue,
    which is fine with my original specification but lacks maximal flexibility.

    There are many ways to work around this degradation in performance, but it
    seems like all of them impose other limitations or inelegances. Oh well.
    if these things were too easy I guess anyone could do it and I'd be out of
    a job :)


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Jan 25, 2008
    #15
  16. xhoster

    xhoster Guest

    The problem here is that, under some uses, @Queue can accumulate an fatal
    number of replicates. While %done prevents them from being processed
    more than once, they are still stored more than once. If each of 10_000
    work-items build up on the queue 10_000 times each, that becomes a storage
    problem. So you need to use %done to guard against adding replicates into
    @Queue to defend memory usage.

    But replicates can still build up to unacceptable levels before the first
    instance of one ever gets $done{}. So then you need another hash,
    %queued_but_not_done, to reject duplicate entries at the earliest stage.
    And then you say "This is a mess, let's just use one hash", then you
    discover that "each" performs poorly on a shrinking hash, and then you post
    a message here. :)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Jan 25, 2008
    #16
  17. On Jan 25, 12:32 pm, wrote:
    ...
    From this I infer that you are engaging in parallel processing. Is
    this correct?
     
    nolo contendere, Jan 25, 2008
    #17
  18. Even with tie slowness, a Tie::IxHash solution would at least be more
    straightforward since insertion order is maintained... I think.

    tie my %hash, 'Tie::IxHash';
    my $idx = 0;

    while ( my @queue = keys %hash ) {
    my $to_do = $queue[$idx++];
    delete $hash{ $to_do };
    # do stuff with $to_do, which might make
    # new entries in %hash
    }
     
    comp.llang.perl.moderated, Jan 25, 2008
    #18
  19. Well, it seemed like a good idea a minute ago.
    But, you might need to tweak $idx for instance...
    Maybe other problems too...

    while ( my @Queue = keys %hash ) {
    $idx = $#queue if $idx > $#queue;
    my $to_do = $queue[$idx++];

    delete $hash{ $to_do };
    # do stuff with $to_do, which might make
    # new entries in %hash

    }
     
    comp.llang.perl.moderated, Jan 25, 2008
    #19
  20. xhoster

    Ben Morrow Guest

    I have to admit I've never seen the point of Tie::IxHash. It's just the
    same as the parallel array/hash Xho'd already rejected, but slower...

    Ben
     
    Ben Morrow, Jan 25, 2008
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.