P
Phrogz
Summary
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.
Background
=============================================================
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.
As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.
Methodology
=============================================================
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.
I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.
The Results
=============================================================
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).
Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)
The Algorithms
=============================================================
module HashEm
SIGNEDSHORT = 0x7FFFFFFF
def self.rs( str, len=str.length )
a,b = 63689,378551
hash = 0
len.times{ |i|
hash = hash*a + str
a *= b
}
hash & SIGNEDSHORT
end
def self.js( str, len=str.length )
hash = 1315423911
len.times{ |i|
hash ^= ( ( hash << 5 ) + str + ( hash >> 2 ) )
}
hash & SIGNEDSHORT
end
def self.elf( str, len=str.length )
hash = 0
x = 0
len.times{ |i|
hash = (hash << 4) + str
if (x = hash & 0xF0000000) != 0
hash ^= (x >> 24)
hash &= ~x
end
}
hash & SIGNEDSHORT
end
def self.bkdr( str, len=str.length )
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
len.times{ |i|
hash = ( hash*seed )+str
}
hash & SIGNEDSHORT
end
def self.sdbm( str, len=str.length )
hash = 0
len.times{ |i|
hash = str + ( hash << 6 ) + ( hash << 16 ) - hash
}
hash & SIGNEDSHORT
end
def self.djb( str, len=str.length )
hash = 5381
len.times{ |i|
hash = ((hash << 5) + hash) + str
}
hash & SIGNEDSHORT
end
def self.ap( str, len=str.length )
hash = 0
len.times{ |i|
if (i & 1) == 0
hash ^= (hash << 7) ^ str ^ (hash >> 3)
else
hash ^= ~( (hash << 11) ^ str ^ (hash >> 5) )
end
}
hash & SIGNEDSHORT
end
end
[1] http://www.code-source.org/algorithm.php?id=62
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.
Background
=============================================================
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.
As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.
Methodology
=============================================================
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.
I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.
The Results
=============================================================
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).
Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)
The Algorithms
=============================================================
module HashEm
SIGNEDSHORT = 0x7FFFFFFF
def self.rs( str, len=str.length )
a,b = 63689,378551
hash = 0
len.times{ |i|
hash = hash*a + str
a *= b
}
hash & SIGNEDSHORT
end
def self.js( str, len=str.length )
hash = 1315423911
len.times{ |i|
hash ^= ( ( hash << 5 ) + str + ( hash >> 2 ) )
}
hash & SIGNEDSHORT
end
def self.elf( str, len=str.length )
hash = 0
x = 0
len.times{ |i|
hash = (hash << 4) + str
if (x = hash & 0xF0000000) != 0
hash ^= (x >> 24)
hash &= ~x
end
}
hash & SIGNEDSHORT
end
def self.bkdr( str, len=str.length )
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
len.times{ |i|
hash = ( hash*seed )+str
}
hash & SIGNEDSHORT
end
def self.sdbm( str, len=str.length )
hash = 0
len.times{ |i|
hash = str + ( hash << 6 ) + ( hash << 16 ) - hash
}
hash & SIGNEDSHORT
end
def self.djb( str, len=str.length )
hash = 5381
len.times{ |i|
hash = ((hash << 5) + hash) + str
}
hash & SIGNEDSHORT
end
def self.ap( str, len=str.length )
hash = 0
len.times{ |i|
if (i & 1) == 0
hash ^= (hash << 7) ^ str ^ (hash >> 3)
else
hash ^= ~( (hash << 11) ^ str ^ (hash >> 5) )
end
}
hash & SIGNEDSHORT
end
end
[1] http://www.code-source.org/algorithm.php?id=62