a promising hash

Discussion in 'Perl Misc' started by bob_jenkins@burtleburtle.net, Aug 16, 2006.

  1. Guest

    The C code
    for (hash=seed, i=0; i<length; ++i)
    hash = tab[(key ^ hash) & 0xff] ^ (hash >> 8);
    with the bytes of tab[] being permutations of 0..255 is a nice portable
    hash, but not all that fast. Or it's a CRC if you fill the table
    appropriately. A CRC is linear, has many nice mathematical properties
    you may want (or may want to not have). Filling tab[] with
    permutations, I've got to check whether that's linear in some sense,
    but I don't think it is. Nonlinear functions make better
    approximations of random mappings.

    A CRC can be made twice as fast,
    for (hash=seed, i=0; i<length; ++i)
    hash = tab[(key ^ hash) & 0xffff] ^ (hash >> 16);
    if key[] has 2-byte entries instead of 1-byte entries, and it can
    compute the same hash as the 1-byte version (though you'd need
    different tables for different endian machines). In fact you could
    wrap the one byte version around the two byte version, using the two
    byte version on an aligned subset of the data and the one byte version
    on any unaligned head and tail.

    Can the same trick be used where tab[] fills each byte with
    permutations, or was that trick possible only due to the linearity of
    CRC? Apparently the trick can be used no matter how tab[] is filled.
    Looking at individual bytes:
    a, b, c, d (let x = d^k)
    => ta[x], tb[x] ^ a, tc[x] ^ b, td[x] ^ c (let y = td[x] ^ (c^k[i+1]))
    => ta[y], tb[y] ^ ta[x], tc[y] ^ tb[x] ^ a, td[y] ^ tc[x] ^ b
    All the bytes of the 64k lookup table are well defined, using (x,y) as
    index and combinations of ta, tb, tc, td to define the values. Given
    d^k, c^k[i+1], and td[] we can get x,y. The ^(hash>>16) would take
    care of XORing a,b to the bottom two bytes. One iteration of the two
    byte version produces the same result as two iterations of the one byte
    version.

    I've got to do some more tuning and timings and tests, but this strikes
    me as a winner for Perl's internal hash function. It can be fast for
    short strings (using the 1-byte version), fast for long strings (mostly
    using the 2-byte version), doesn't care about alignment, and can give
    the same results on both big endian and little endian machines. The
    most likely thing to kill it would be if the 64k array for the two-byte
    version gets paged out often enough that the cost of loading it is more
    than the benefit of using it. Or if it's linear in GF(2^^8) with
    unfortunate properties or something like that. I've also got to time
    it against Paul Hsieh's hash.
     
    , Aug 16, 2006
    #1
    1. Advertising

  2. writes:

    > I've got to do some more tuning and timings and tests, but this strikes
    > me as a winner for Perl's internal hash function.


    I'd suggest bringing this up on p5p, aka the perl5-porters mailing list.
    Info on the list is here:

    <http://lists.cpan.org/showlist.cgi?name=perl5-porters>

    Keep in mind that working patches and benchmark results will be taken a
    lot more seriously than ideas and theory.

    sherm--

    --
    Web Hosting by West Virginians, for West Virginians: http://wv-www.net
    Cocoa programming in Perl: http://camelbones.sourceforge.net
     
    Sherm Pendley, Aug 16, 2006
    #2
    1. Advertising

  3. Guest

    Sherm Pendley wrote:
    > writes:
    >
    > > I've got to do some more tuning and timings and tests, but this strikes
    > > me as a winner for Perl's internal hash function.

    >
    > I'd suggest bringing this up on p5p, aka the perl5-porters mailing list.
    > Info on the list is here:
    >
    > <http://lists.cpan.org/showlist.cgi?name=perl5-porters>
    >
    > Keep in mind that working patches and benchmark results will be taken a
    > lot more seriously than ideas and theory.
    >
    > sherm--
    >
    > --
    > Web Hosting by West Virginians, for West Virginians: http://wv-www.net
    > Cocoa programming in Perl: http://camelbones.sourceforge.net


    Nevermind. At least on a Pentium 4 running gcc 3.3.3, my latest hash
    (http://burtleburtle.net/bob/c/lookup3.c) can also give consistent
    results regardless of endianness, is faster for 100-byte strings than
    the crc-like hash I described above regardless of endianness (2x faster
    on little endian, 1.5x faster on big endian), and its speed is roughly
    the same for 4-byte strings. Plus it has no tables instead of that 64k
    lookup table, so caching would be less of an issue.

    Last I checked Perl was using my one-at-a-time hash, which is slower
    than either of those hashes. Someone did benchmarks recently showing
    Paul Hsieh's hash would be much faster than one-at-a-time, which is
    true. Which of Paul's and lookup3's hash was faster varied with the
    chip and compiler I tried; but I can plug my lookup3 by saying it's
    more thorough and you can extract a 64-bit result from it for the same
    price as a 32-bit result. All these hashes (the crc-like one, lookup3,
    one-at-a-time, and Paul's) are thorough enough that you can use hash
    tables that are powers of two and probably never notice any flaws in
    the way they hash.
     
    , Aug 20, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Red Orchid
    Replies:
    3
    Views:
    1,063
  2. Pieter Claassen
    Replies:
    1
    Views:
    1,128
    CBFalconer
    Aug 4, 2004
  3. Bo Peng
    Replies:
    4
    Views:
    802
  4. rp
    Replies:
    1
    Views:
    556
    red floyd
    Nov 10, 2011
  5. Srijayanth Sridhar
    Replies:
    19
    Views:
    641
    David A. Black
    Jul 2, 2008
Loading...

Share This Page