String Hashing Algorithms

P

Phrogz

Summary
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.


Background
=============================================================
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.


Methodology
=============================================================
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.

I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.


The Results
=============================================================
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).

Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)


The Algorithms
=============================================================
module HashEm
SIGNEDSHORT = 0x7FFFFFFF

def self.rs( str, len=str.length )
a,b = 63689,378551
hash = 0
len.times{ |i|
hash = hash*a + str
a *= b
}
hash & SIGNEDSHORT
end

def self.js( str, len=str.length )
hash = 1315423911
len.times{ |i|
hash ^= ( ( hash << 5 ) + str + ( hash >> 2 ) )
}
hash & SIGNEDSHORT
end

def self.elf( str, len=str.length )
hash = 0
x = 0
len.times{ |i|
hash = (hash << 4) + str
if (x = hash & 0xF0000000) != 0
hash ^= (x >> 24)
hash &= ~x
end
}
hash & SIGNEDSHORT
end

def self.bkdr( str, len=str.length )
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
len.times{ |i|
hash = ( hash*seed )+str
}
hash & SIGNEDSHORT
end

def self.sdbm( str, len=str.length )
hash = 0
len.times{ |i|
hash = str + ( hash << 6 ) + ( hash << 16 ) - hash
}
hash & SIGNEDSHORT
end

def self.djb( str, len=str.length )
hash = 5381
len.times{ |i|
hash = ((hash << 5) + hash) + str
}
hash & SIGNEDSHORT
end

def self.ap( str, len=str.length )
hash = 0
len.times{ |i|
if (i & 1) == 0
hash ^= (hash << 7) ^ str ^ (hash >> 3)
else
hash ^= ~( (hash << 11) ^ str ^ (hash >> 5) )
end
}
hash & SIGNEDSHORT
end
end



[1] http://www.code-source.org/algorithm.php?id=62
 
J

Joel VanderWerf

Phrogz said:
Summary
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.

Is ruby's own String#hash one of the ones you tested? If not, how does
it compare?
 
P

Phrogz

Joel said:
Is ruby's own String#hash one of the ones you tested? If not, how does
it compare?

Because I was interested in C algorithms, I didn't consider it. But, I
just ran it again, and ruby came out with the best of them:

ruby :: 100.0000 percent coverage (0 collisions out of 42884)
 
J

Joel VanderWerf

Phrogz said:
Because I was interested in C algorithms, I didn't consider it. But, I
just ran it again, and ruby came out with the best of them:

ruby :: 100.0000 percent coverage (0 collisions out of 42884)

It is coded in C, and looking at rb_str_hash() in string.h, I see that
it is one of HASH_ELFHASH, HASH_PERL, or a third (very simple) one.
 
P

Phrogz

How can I figure out which hash it's using? It's not the ELF hash,
since that was the worst performer of the lot. Searching the source for
HASH_PERL I don't see it #defined anywhere, but my C is really, really
rusty. Does this mean that it's using the 'fallback' hash algorithm?
Does it vary by build per platform?
 
Z

Zed A. Shaw

Background
=============================================================
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage? Off the top of my head I can think of:

* Trie -- Should find other strings really fast, but gets pretty big the
more strings you need to store. There's a C library for this at
http://www.octavian.org/cs/software.html
* PATRICIA -- Basically a compacted Trie which takes less space.
Couldn't find a library for this one.
* Suffix Array -- I have a Ruby binding for one of the faster C
libraries which I use in FastCST. The big advantage of a Suffix Array
is that you can store it so that you only need to calculate the suffix
array once. A suffix array is really more useful for matching and
exclusion.
* Suffix Tree -- There's really not much of a reason to use suffix trees
these days since newer suffix array construction algorithms are faster.
The main advantage of a suffix tree is that searching for a result can
be faster. The main disadvntages are that they are fat memory pigs.
* Bloom Filter -- These are not as acurate, but they can be fast as all
hell if you just want to match strings with some probability.

Anyway, just thought I'd throw out some alternatives to hash tables for
certain situations. I'd say if you need to match C++ keywords to a
stored index then take a look at the libtst Trie implementation. It's
quite fast for that application. If you just need to see if a certain
word is "included" or "excluded" than a suffix array could do that
really fast. Trick there is to build the suffix array by joining the
strings with a | character (or something else) between them. Nice thing
about a suffix array is that you can build it offline and then just load
it directly for the searching.

Zed A. Shaw
http://www.zedshaw.com/
 
C

Clifford Heath

Phrogz said:

Good summary and test, Phrogz.

Don't forget that in many hashing algorithms, the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain. If you need guaranteed lookup time, you need a "perfect hash"
(try Googling on that) but otherwise you just need a fast hash function
with an excellent spread and a low load factor. Neither of those is
hard to achieve.

We use linear hashing (note, this is not the same as linear open
addressing, a much older technique), which achieves excellent
performance with any size hash table. Unfortunately I have no
implementation I can publish.

Clifford Heath.
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: String Hashing Algorithms"

|How can I figure out which hash it's using?

We are using the last one, unless specifically defined.

matz.
 
C

Charles Mills

Zed said:
Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage? Off the top of my head I can think of:

* Trie -- Should find other strings really fast, but gets pretty big the
more strings you need to store. There's a C library for this at
http://www.octavian.org/cs/software.html
* PATRICIA -- Basically a compacted Trie which takes less space.
Couldn't find a library for this one.
* Suffix Array -- I have a Ruby binding for one of the faster C
libraries which I use in FastCST. The big advantage of a Suffix Array
is that you can store it so that you only need to calculate the suffix
array once. A suffix array is really more useful for matching and
exclusion.
* Suffix Tree -- There's really not much of a reason to use suffix trees
these days since newer suffix array construction algorithms are faster.
The main advantage of a suffix tree is that searching for a result can
be faster. The main disadvntages are that they are fat memory pigs.
* Bloom Filter -- These are not as acurate, but they can be fast as all
hell if you just want to match strings with some probability.

Anyway, just thought I'd throw out some alternatives to hash tables for
certain situations. I'd say if you need to match C++ keywords to a
stored index then take a look at the libtst Trie implementation. It's
quite fast for that application. If you just need to see if a certain
word is "included" or "excluded" than a suffix array could do that
really fast. Trick there is to build the suffix array by joining the
strings with a | character (or something else) between them. Nice thing
about a suffix array is that you can build it offline and then just load
it directly for the searching.

Zed A. Shaw
http://www.zedshaw.com/

You may be interested in this project:
http://www.rubyforge.org/projects/judy

Has an implementation of something like what you describe as a PATRICIA
above and a sparse hash.

-Charlie
 
M

Martin Ankerl

At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

If you know all possible strings in advance, you might want to generate
a perfect hash:
http://burtleburtle.net/bob/hash/perfect.html

martinus
 
G

Gavin Kistner

Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage?

Thank you for all the links. I've passed them on to the engineers in
charge of this particular refactor. (Despite my love of Ruby, I'm not
actually a programmer on the engineering team, so I only got involved
in overhearing them talking and wanting to investigate how often
hashing algorithms collide.)
 
G

Gavin Kistner

We are using the last one, unless specifically defined.

Thanks. FWIW, though I know fine-tuning the performance of Ruby is
not a high-priority, that last algorithm uses multiplication, while
some of the other hashes I posted do not. I don't know how often
String.hash is used, or how much of a difference a change of
implementation would make (in speed) or how the algorithm being used
fares compared to something like SDBM in real-world tests...but if it
did compare favorably, was much faster, and String.hash was used in
tons of places in the code, it might be a nice little speed boost.
 
G

Gavin Kistner

Don't forget that in many hashing algorithms, the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain.

That's not something I knew, so it's hard to unforget it. Thanks :)

I *think* that the proposed application within my company is a one-
time optimization. (Hash the nice human/programmer-readable strings
and from then on store and use the results only as shorts.)

I have a BS comp-sci degree, but apparently I need to go back to
school and learn myself some in-depth algorithm and data structure
stuff.

Thanks also for the note about "linear hashing". Looks interesting.
 
M

Michael Neumann

Am Tuesday 10 May 2005 22:00 schrieb Phrogz:
Summary
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.

You might be interested in the following page, which compares several
different hash-algorithms:

http://www.fantasy-coders.de/projects/gh/html/x435.html

But, It's written in german language ;-)

Regards,

Michael
 
C

Clifford Heath

Gavin said:
That's not something I knew, so it's hard to unforget it. Thanks :)

It's probably a little more difficult and woolley now that CPUs
are so very much faster than RAM, and caches are big and wide to
compensate. But in the old days, compilers used techniques to
hash identifiers like hashing the first few characters, the last
few, and if the identifier was long, some of the middle ones as
well, ignoring any other characters. This was done to avoid CPU
cycles integrating the characters. It also works better with the
typical character spread of identifiers rather than non-specific
data.

Nowadays, if you have to fill a cache line, you might as well hash
all the bytes you loaded - which will often be 8 or 16 bytes. The
extra CPU cycles required will disappear in wait states anyhow...

If the data is ASCII, most of the entropy is in the bottom four
bits. If you treat the characters in groups of four, you shouldn't
just XOR all the groups together, because though you get a 32-bit
result, 16 of those bits have very little entropy. I usually do a
5-bit rotate of the accumulating hash value between each word,
which is often enough. Not quite as good statistical spread as a
CRC, which you can match by randomising each step with a multiply
and add. Here's the function I use, which works pretty well, though
I won't claim it's optimal - these things always have tradeoffs.
For example, this doesn't assume that cp is passed in word-aligned,
though that will often be the case.

HashValT
HashBlock(
const char* cp,
size_t bytes
)
{
HashValT val = 0;
union {
char four[4];
HashValT one;
};

for (bytes += 4; bytes > 4;) // Avoid unsigned woes...
{
one = 0; // Initialise all 4 at once
switch (bytes -= 4) // Duff's device:
{
default: four[3] = *cp++; // Fall through
case 3: four[2] = *cp++; // Fall through
case 2: four[1] = *cp++; // Fall through
case 1: four[0] = *cp++;
}
val ^= one;
val = val*1103515245L + 12345; // For good measure...
val = ((val<<5) ^ (val>>(32-5))) ^ one; // Rotate 5
}
return val;
}
I *think* that the proposed application within my company is a one- time
optimization.

If you google on "perfect hashing", that will lead you to a nice tool
for this job. It finds a guaranteed 1:1 hash with close to minimum
computation on a predefined set of strings.

Clifford Heath.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top