String Hashing Algorithms

Discussion in 'Ruby' started by Phrogz, May 10, 2005.

  1. Phrogz

    Phrogz Guest

    Summary
    =============================================================
    A brief analysis of Ruby versions of a few string hashing algorithms,
    and my conclusions about their efficacy at preventing collisions.


    Background
    =============================================================
    At work, we have some (young) code which is currently doing a lot of
    string compares, which needs to be really fast. We're about to try
    using hashes of the strings to represent each string instead, but
    instead of a traditional hash table (which 'chains' collisions for the
    same key) we're going to require that no two (different) strings may
    share the same hash key.

    As a result, I wanted to do a preliminary investigation into various
    string hashing algorithms, to see how limiting that would be. (How
    often would we need to rename a bunch of our strings just to get around
    collisions?) Given the nature of hashing algorithms, it was important
    that this be based on ~real world strings.


    Methodology
    =============================================================
    I wrote a ruby script to scan all our (C++) source code and find ANY
    words that were valid C++ identifiers. This also meant that I grabbed
    all such words from inside comment blocks, as well as C++ keywords, but
    that was OK for this back-of-the-envelope testing. I ended up with
    almost 43,000 unique identifiers.

    I then ran through the list of words and used 7 different string
    hashing algorithms (which I ported from the C code I found on a web
    page[1]) to compute different hash keys. I used a Set (per hashing
    algorithm) to store these keys, and at the end compared the number of
    keys in the Set with the number of identifiers fed in.


    The Results
    =============================================================
    Three of the hashing algorithms ended up with no collisions whatsoever,
    and one of these (SDBM) is really simple and fast (in C).

    Here is the output of my test:
    djb :: 99.8601 percent coverage (60 collisions out of 42884)
    elf :: 99.5430 percent coverage (196 collisions out of 42884)
    sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
    js :: 99.9044 percent coverage (41 collisions out of 42884)
    ap :: 100.0000 percent coverage (0 collisions out of 42884)
    bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
    rs :: 100.0000 percent coverage (0 collisions out of 42884)


    The Algorithms
    =============================================================
    module HashEm
    SIGNEDSHORT = 0x7FFFFFFF

    def self.rs( str, len=str.length )
    a,b = 63689,378551
    hash = 0
    len.times{ |i|
    hash = hash*a + str
    a *= b
    }
    hash & SIGNEDSHORT
    end

    def self.js( str, len=str.length )
    hash = 1315423911
    len.times{ |i|
    hash ^= ( ( hash << 5 ) + str + ( hash >> 2 ) )
    }
    hash & SIGNEDSHORT
    end

    def self.elf( str, len=str.length )
    hash = 0
    x = 0
    len.times{ |i|
    hash = (hash << 4) + str
    if (x = hash & 0xF0000000) != 0
    hash ^= (x >> 24)
    hash &= ~x
    end
    }
    hash & SIGNEDSHORT
    end

    def self.bkdr( str, len=str.length )
    seed = 131 # 31 131 1313 13131 131313 etc..
    hash = 0
    len.times{ |i|
    hash = ( hash*seed )+str
    }
    hash & SIGNEDSHORT
    end

    def self.sdbm( str, len=str.length )
    hash = 0
    len.times{ |i|
    hash = str + ( hash << 6 ) + ( hash << 16 ) - hash
    }
    hash & SIGNEDSHORT
    end

    def self.djb( str, len=str.length )
    hash = 5381
    len.times{ |i|
    hash = ((hash << 5) + hash) + str
    }
    hash & SIGNEDSHORT
    end

    def self.ap( str, len=str.length )
    hash = 0
    len.times{ |i|
    if (i & 1) == 0
    hash ^= (hash << 7) ^ str ^ (hash >> 3)
    else
    hash ^= ~( (hash << 11) ^ str ^ (hash >> 5) )
    end
    }
    hash & SIGNEDSHORT
    end
    end



    [1] http://www.code-source.org/algorithm.php?id=62
    Phrogz, May 10, 2005
    #1
    1. Advertising

  2. Phrogz wrote:
    > Summary
    > =============================================================
    > A brief analysis of Ruby versions of a few string hashing algorithms,
    > and my conclusions about their efficacy at preventing collisions.


    Is ruby's own String#hash one of the ones you tested? If not, how does
    it compare?
    Joel VanderWerf, May 10, 2005
    #2
    1. Advertising

  3. Phrogz

    Phrogz Guest

    Joel VanderWerf wrote:
    > Phrogz wrote:
    > > A brief analysis of Ruby versions of a few string hashing

    algorithms,
    > > and my conclusions about their efficacy at preventing collisions.

    >
    > Is ruby's own String#hash one of the ones you tested? If not, how

    does
    > it compare?


    Because I was interested in C algorithms, I didn't consider it. But, I
    just ran it again, and ruby came out with the best of them:

    ruby :: 100.0000 percent coverage (0 collisions out of 42884)
    Phrogz, May 10, 2005
    #3
  4. Phrogz wrote:
    > Joel VanderWerf wrote:
    >
    >>Phrogz wrote:
    >>
    >>>A brief analysis of Ruby versions of a few string hashing

    >
    > algorithms,
    >
    >>>and my conclusions about their efficacy at preventing collisions.

    >>
    >>Is ruby's own String#hash one of the ones you tested? If not, how

    >
    > does
    >
    >>it compare?

    >
    >
    > Because I was interested in C algorithms, I didn't consider it. But, I
    > just ran it again, and ruby came out with the best of them:
    >
    > ruby :: 100.0000 percent coverage (0 collisions out of 42884)
    >


    It is coded in C, and looking at rb_str_hash() in string.h, I see that
    it is one of HASH_ELFHASH, HASH_PERL, or a third (very simple) one.
    Joel VanderWerf, May 10, 2005
    #4
  5. Phrogz

    Phrogz Guest

    How can I figure out which hash it's using? It's not the ELF hash,
    since that was the worst performer of the lot. Searching the source for
    HASH_PERL I don't see it #defined anywhere, but my C is really, really
    rusty. Does this mean that it's using the 'fallback' hash algorithm?
    Does it vary by build per platform?
    Phrogz, May 10, 2005
    #5
  6. Phrogz

    Zed A. Shaw Guest

    On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:
    >
    > Background
    > =============================================================
    > At work, we have some (young) code which is currently doing a lot of
    > string compares, which needs to be really fast. We're about to try
    > using hashes of the strings to represent each string instead, but
    > instead of a traditional hash table (which 'chains' collisions for the
    > same key) we're going to require that no two (different) strings may
    > share the same hash key.


    Just out of curiosity, but have you considered some other
    structures/algorithms which might be alternatives depending on your
    usage? Off the top of my head I can think of:

    * Trie -- Should find other strings really fast, but gets pretty big the
    more strings you need to store. There's a C library for this at
    http://www.octavian.org/cs/software.html
    * PATRICIA -- Basically a compacted Trie which takes less space.
    Couldn't find a library for this one.
    * Suffix Array -- I have a Ruby binding for one of the faster C
    libraries which I use in FastCST. The big advantage of a Suffix Array
    is that you can store it so that you only need to calculate the suffix
    array once. A suffix array is really more useful for matching and
    exclusion.
    * Suffix Tree -- There's really not much of a reason to use suffix trees
    these days since newer suffix array construction algorithms are faster.
    The main advantage of a suffix tree is that searching for a result can
    be faster. The main disadvntages are that they are fat memory pigs.
    * Bloom Filter -- These are not as acurate, but they can be fast as all
    hell if you just want to match strings with some probability.

    Anyway, just thought I'd throw out some alternatives to hash tables for
    certain situations. I'd say if you need to match C++ keywords to a
    stored index then take a look at the libtst Trie implementation. It's
    quite fast for that application. If you just need to see if a certain
    word is "included" or "excluded" than a suffix array could do that
    really fast. Trick there is to build the suffix array by joining the
    strings with a | character (or something else) between them. Nice thing
    about a suffix array is that you can build it offline and then just load
    it directly for the searching.

    Zed A. Shaw
    http://www.zedshaw.com/
    Zed A. Shaw, May 10, 2005
    #6
  7. Phrogz wrote:
    > Summary


    Good summary and test, Phrogz.

    Don't forget that in many hashing algorithms, the cost of computing
    the hash function far outweighs the occasional cost of traversing a
    chain. If you need guaranteed lookup time, you need a "perfect hash"
    (try Googling on that) but otherwise you just need a fast hash function
    with an excellent spread and a low load factor. Neither of those is
    hard to achieve.

    We use linear hashing (note, this is not the same as linear open
    addressing, a much older technique), which achieves excellent
    performance with any size hash table. Unfortunately I have no
    implementation I can publish.

    Clifford Heath.
    Clifford Heath, May 11, 2005
    #7
  8. Hi,

    In message "Re: String Hashing Algorithms"
    on Wed, 11 May 2005 07:20:26 +0900, "Phrogz" <> writes:

    |How can I figure out which hash it's using?

    We are using the last one, unless specifically defined.

    matz.
    Yukihiro Matsumoto, May 11, 2005
    #8
  9. Zed A. Shaw wrote:
    > On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:
    > >
    > > Background
    > > =============================================================
    > > At work, we have some (young) code which is currently doing a lot

    of
    > > string compares, which needs to be really fast. We're about to try
    > > using hashes of the strings to represent each string instead, but
    > > instead of a traditional hash table (which 'chains' collisions for

    the
    > > same key) we're going to require that no two (different) strings

    may
    > > share the same hash key.

    >
    > Just out of curiosity, but have you considered some other
    > structures/algorithms which might be alternatives depending on your
    > usage? Off the top of my head I can think of:
    >
    > * Trie -- Should find other strings really fast, but gets pretty big

    the
    > more strings you need to store. There's a C library for this at
    > http://www.octavian.org/cs/software.html
    > * PATRICIA -- Basically a compacted Trie which takes less space.
    > Couldn't find a library for this one.
    > * Suffix Array -- I have a Ruby binding for one of the faster C
    > libraries which I use in FastCST. The big advantage of a Suffix

    Array
    > is that you can store it so that you only need to calculate the

    suffix
    > array once. A suffix array is really more useful for matching and
    > exclusion.
    > * Suffix Tree -- There's really not much of a reason to use suffix

    trees
    > these days since newer suffix array construction algorithms are

    faster.
    > The main advantage of a suffix tree is that searching for a result

    can
    > be faster. The main disadvntages are that they are fat memory pigs.
    > * Bloom Filter -- These are not as acurate, but they can be fast as

    all
    > hell if you just want to match strings with some probability.
    >
    > Anyway, just thought I'd throw out some alternatives to hash tables

    for
    > certain situations. I'd say if you need to match C++ keywords to a
    > stored index then take a look at the libtst Trie implementation.

    It's
    > quite fast for that application. If you just need to see if a

    certain
    > word is "included" or "excluded" than a suffix array could do that
    > really fast. Trick there is to build the suffix array by joining the
    > strings with a | character (or something else) between them. Nice

    thing
    > about a suffix array is that you can build it offline and then just

    load
    > it directly for the searching.
    >
    > Zed A. Shaw
    > http://www.zedshaw.com/


    You may be interested in this project:
    http://www.rubyforge.org/projects/judy

    Has an implementation of something like what you describe as a PATRICIA
    above and a sparse hash.

    -Charlie
    Charles Mills, May 11, 2005
    #9
  10. > At work, we have some (young) code which is currently doing a lot of
    > string compares, which needs to be really fast. We're about to try
    > using hashes of the strings to represent each string instead, but
    > instead of a traditional hash table (which 'chains' collisions for the
    > same key) we're going to require that no two (different) strings may
    > share the same hash key.


    If you know all possible strings in advance, you might want to generate
    a perfect hash:
    http://burtleburtle.net/bob/hash/perfect.html

    martinus
    Martin Ankerl, May 11, 2005
    #10
  11. On May 11, 2005, at 12:05 AM, Martin Ankerl wrote:
    > If you know all possible strings in advance, you might want to
    > generate a perfect hash:
    > http://burtleburtle.net/bob/hash/perfect.html


    Ooh, that's quite helpful, thanks!
    Gavin Kistner, May 11, 2005
    #11
  12. On May 10, 2005, at 4:38 PM, Zed A. Shaw wrote:
    > Just out of curiosity, but have you considered some other
    > structures/algorithms which might be alternatives depending on your
    > usage?


    Thank you for all the links. I've passed them on to the engineers in
    charge of this particular refactor. (Despite my love of Ruby, I'm not
    actually a programmer on the engineering team, so I only got involved
    in overhearing them talking and wanting to investigate how often
    hashing algorithms collide.)
    Gavin Kistner, May 11, 2005
    #12
  13. On May 10, 2005, at 5:58 PM, Yukihiro Matsumoto wrote:
    >> How can I figure out which hash it's using?

    >
    > We are using the last one, unless specifically defined.


    Thanks. FWIW, though I know fine-tuning the performance of Ruby is
    not a high-priority, that last algorithm uses multiplication, while
    some of the other hashes I posted do not. I don't know how often
    String.hash is used, or how much of a difference a change of
    implementation would make (in speed) or how the algorithm being used
    fares compared to something like SDBM in real-world tests...but if it
    did compare favorably, was much faster, and String.hash was used in
    tons of places in the code, it might be a nice little speed boost.
    Gavin Kistner, May 11, 2005
    #13
  14. On May 10, 2005, at 5:30 PM, Clifford Heath wrote:
    > Don't forget that in many hashing algorithms, the cost of computing
    > the hash function far outweighs the occasional cost of traversing a
    > chain.


    That's not something I knew, so it's hard to unforget it. Thanks :)

    I *think* that the proposed application within my company is a one-
    time optimization. (Hash the nice human/programmer-readable strings
    and from then on store and use the results only as shorts.)

    I have a BS comp-sci degree, but apparently I need to go back to
    school and learn myself some in-depth algorithm and data structure
    stuff.

    Thanks also for the note about "linear hashing". Looks interesting.
    Gavin Kistner, May 11, 2005
    #14
  15. Am Tuesday 10 May 2005 22:00 schrieb Phrogz:
    > Summary
    > =============================================================
    > A brief analysis of Ruby versions of a few string hashing algorithms,
    > and my conclusions about their efficacy at preventing collisions.


    You might be interested in the following page, which compares several
    different hash-algorithms:

    http://www.fantasy-coders.de/projects/gh/html/x435.html

    But, It's written in german language ;-)

    Regards,

    Michael
    Michael Neumann, May 11, 2005
    #15
  16. Gavin Kistner wrote:
    >> the cost of computing
    >> the hash function far outweighs the occasional cost of traversing a
    >> chain.

    > That's not something I knew, so it's hard to unforget it. Thanks :)


    It's probably a little more difficult and woolley now that CPUs
    are so very much faster than RAM, and caches are big and wide to
    compensate. But in the old days, compilers used techniques to
    hash identifiers like hashing the first few characters, the last
    few, and if the identifier was long, some of the middle ones as
    well, ignoring any other characters. This was done to avoid CPU
    cycles integrating the characters. It also works better with the
    typical character spread of identifiers rather than non-specific
    data.

    Nowadays, if you have to fill a cache line, you might as well hash
    all the bytes you loaded - which will often be 8 or 16 bytes. The
    extra CPU cycles required will disappear in wait states anyhow...

    If the data is ASCII, most of the entropy is in the bottom four
    bits. If you treat the characters in groups of four, you shouldn't
    just XOR all the groups together, because though you get a 32-bit
    result, 16 of those bits have very little entropy. I usually do a
    5-bit rotate of the accumulating hash value between each word,
    which is often enough. Not quite as good statistical spread as a
    CRC, which you can match by randomising each step with a multiply
    and add. Here's the function I use, which works pretty well, though
    I won't claim it's optimal - these things always have tradeoffs.
    For example, this doesn't assume that cp is passed in word-aligned,
    though that will often be the case.

    HashValT
    HashBlock(
    const char* cp,
    size_t bytes
    )
    {
    HashValT val = 0;
    union {
    char four[4];
    HashValT one;
    };

    for (bytes += 4; bytes > 4;) // Avoid unsigned woes...
    {
    one = 0; // Initialise all 4 at once
    switch (bytes -= 4) // Duff's device:
    {
    default: four[3] = *cp++; // Fall through
    case 3: four[2] = *cp++; // Fall through
    case 2: four[1] = *cp++; // Fall through
    case 1: four[0] = *cp++;
    }
    val ^= one;
    val = val*1103515245L + 12345; // For good measure...
    val = ((val<<5) ^ (val>>(32-5))) ^ one; // Rotate 5
    }
    return val;
    }

    > I *think* that the proposed application within my company is a one- time
    > optimization.


    If you google on "perfect hashing", that will lead you to a nice tool
    for this job. It finds a guaranteed 1:1 hash with close to minimum
    computation on a predefined set of strings.

    Clifford Heath.
    Clifford Heath, May 12, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. s
    Replies:
    2
    Views:
    7,738
    red floyd
    Dec 17, 2003
  2. Priya Ghate

    Hashing algorithms

    Priya Ghate, Jul 17, 2003, in forum: C Programming
    Replies:
    3
    Views:
    399
    Default User
    Jul 18, 2003
  3. Sebastian Karlsson
    Replies:
    4
    Views:
    346
    Sebastian Karlsson
    Feb 13, 2008
  4. Chris M. Thomasson

    automated test for sub-string search algorithms...

    Chris M. Thomasson, Feb 20, 2010, in forum: C Programming
    Replies:
    3
    Views:
    513
    Chris M. Thomasson
    Feb 23, 2010
  5. Seth

    Simple string hashing

    Seth, Jun 26, 2004, in forum: ASP General
    Replies:
    4
    Views:
    592
Loading...

Share This Page