Perfect hashing for Py

Discussion in 'Python' started by bearophileHUGS, Jul 11, 2008.

  1. Following links from this thread:
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/179e1a45485ab36a#

    I have found this perfect hash (minimal too) implementation:
    http://burtleburtle.net/bob/hash/perfect.html

    I have already translated part of it to D, and it seems to work well
    enough. As discussed in the PyConDue, I think this may be used in
    frozenset (and frozendict) to build a (minimal too?) perfect hash on
    the fly, to allow (hopefully) faster retrieval of items that don't
    change.
    That code is C and I think it's public domain, so if experiments show
    it gives enough speed up, it may be added to CPython 2.6/3.

    Bye,
    bearophile
     
    bearophileHUGS, Jul 11, 2008
    #1
    1. Advertisements

  2. bearophileHUGS

    Casey Guest

    It would be interesting to see if such an algorithm could actually
    provide any significant performance improvements for the size of sets
    that I suspect are most often used in practice. The chance of a hash
    collision for a good 32-bit general is fairly low even for a set of
    1,000,000 unique elements, which seems to me to be a pretty large
    memory-based set. Compare that with the cost of determining a perfect
    hash (O(n**2)?).

    From my perspective, a perfect hash would certainly be a welcome
    addition to the Python library or even as an optional algorithm
    supporting
    hash-based collections.
     
    Casey, Jul 12, 2008
    #2
    1. Advertisements

  3. A few thoughts: Objects currently have their own hash function that
    is independent of a given container. We also have to hash types other
    than strings (used in the examples in the links you provided). While
    a container such as a dict or set is being built, we don't even know
    how many unique keys there are until the container is fully
    populated. Set-to-set operations and set-to-dict operations get their
    speed from knowing that both containers use the same hash values --
    this would get disrupted if each container had its own hash function.
    Some types like strings (especially interned strings) remember their
    own hash value -- this makes it very fast to look their values in a
    set or dict -- that would be defeated if each container has its own
    hash function which would need to be recomputed for every lookup. An
    important use case for sets is uniquification -- in that use case,
    speed is determined by insertion time, we just don't care about
    subsequent lookup time -- anything that slows down insertion (such as
    computing a perfect hash value) works to the detriment of that use
    case). In the current design, sets and dicts are never more than 2/3
    full and are usually much more sparse than that -- accordingly lookups
    average between 1 and 1.5 probes per lookup. This is somewhat hard to
    beat with perfect hashing since we typically get very few collisions
    anyway -- so you get the cost of increased insertion time, loss of
    objects being able to remember their own hash, challenges with non-
    string keys, a more complicated algorithm, and decreased
    interoperability for set-to-set and set-to-dict options -- the only
    benefits are to reduce collisions to zero when they are already not
    that common.

    For frozensets, it may be possible to add an optimize() method that
    takes a completely built frozenset and optimizes its insertion order
    and/or increases its size to make it arbitrarily sparse. That would
    only be useful for cases when the costs of optimizing the table is
    fully repaid by an extensive number of lookups. There are use some
    cases where it would be pay table optimization costs in order to win
    decreased lookup costs, but there are plenty of use cases where that
    would not be a winning trade-off. So, the optimize() step would need
    to be optional, not builtin.


    Raymond

    FWIW, I mentioned all this at PyConDue but the message must have
    gotten lost.
     
    Raymond Hettinger, Jul 12, 2008
    #3
  4. I played around with the idea and have a rough implementation sketch:

    class PerfectSet(collections.Set):
    __slots__ = ['len', 'hashvalue', 'hashfunc', 'hashtable']
    DUMMY = object()
    def __init__(self, keys, sparseness=0.5):
    keys = frozenset(keys)
    self.len = len(keys)
    self.hashvalue = hash(keys)
    n = (len(keys) / sparseness) + 1
    assert n > self.len
    self.hashfunc = make_perfect_hash_function(keys, n)
    self.hashtable = [self.DUMMY] * n
    for k in keys:
    self.hashtable[self.hashfunc(k)] = k
    del __len__(self, k):
    return self.len
    def __iter__(self, k)
    return (k for k in self.hashtable if k is not self.DUMMY)
    def __contains__(self, k):
    return self.table[self.hashfunc(k)] is not self.DUMMY

    The perfect hash function will need to use the regular hash(obj) as a
    starting point. This helps with non-string types and it preserves
    existing relationships between __hash__ and __eq__.

    The construction is more expensive than with regular frozensets so it
    is only useful when lookups are very common.

    Playing around with the idea suggests that a sparseness variable for
    frozensets would largely accomplish the same goal without the overhead
    of creating and applying a perfect hash function. Sparse hash tables
    rarely collide.


    Raymond
     
    Raymond Hettinger, Jul 12, 2008
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.