Benchmarking hashcode algorithms

Roedy Green · Apr 26, 2014

I have been benchmarking some hashing algorithms. The fastest and
fewest collisions are near the top. I used real world data.

The big surprise for me is how good String.hashCode is.

Collecting 20000 filenames from drive E...
The best have the lowest elapsed times and the fewest collisions.

Benchmarking String.hashCode...
Computing 100000 String.hashCode hashes...
Time to compute 100000 iterations of String.hashCode was 12.200
seconds.
Checking 20000 String.hashCode for collisions...
There were 0 collisions.
------

Benchmarking SunCRC32...
Computing 100000 SunCRC32 hashes...
Time to compute 100000 iterations of SunCRC32 was 66.515 seconds.
Checking 20000 SunCRC32 for collisions...
There were 0 collisions.
------

Benchmarking FNV1a32...
Computing 100000 FNV1a32 hashes...
Time to compute 100000 iterations of FNV1a32 was 48.146 seconds.
Checking 20000 FNV1a32 for collisions...
There were 0 collisions.
------

Benchmarking Adler...
Computing 100000 Adler hashes...
Time to compute 100000 iterations of Adler was 45.458 seconds.
Checking 20000 Adler for collisions...
There were 39 collisions.
------

Benchmarking SunCRC16...
Computing 100000 SunCRC16 hashes...
Time to compute 100000 iterations of SunCRC16 was 215.462 seconds.
Checking 20000 SunCRC16 for collisions...
There were 2768 collisions.
------

I will be adding some more. If you want be to test your favourite,
just implement this interface:

/**
* interface that an algorithm must provide to be benchmarked
*/
interface HashAlgorithm
{
/**
* identifying info
*/
String getAlgorithmName();

/**
* compute hash, must be converted to a String for collision
checking.
* Slow version, used for collision checking.
*/
String computeHashString( String s );

/**
* It need not include code to convert to string.
* Fast version, used for timing
*/
void computeHash( String s );
}

If you are inventing a hashing algorithm that do not involve every
byte, keep in mind that in the real world the Strings you are hashing
often begin and end with the same strings.

e.g. E:\mindprod .... .html

You might be able to cook up some special algorithm just for absolute
filenames.

Marcel Müller · Apr 26, 2014

I have been benchmarking some hashing algorithms. The fastest and
fewest collisions are near the top. I used real world data.

The big surprise for me is how good String.hashCode is.

Collecting 20000 filenames from drive E...
The best have the lowest elapsed times and the fewest collisions.

Speed: no surprise. I could bet that most VMs implement String.hashCode
as native code.

Benchmarking SunCRC16...
Computing 100000 SunCRC16 hashes...
Time to compute 100000 iterations of SunCRC16 was 215.462 seconds.
Checking 20000 SunCRC16 for collisions...
There were 2768 collisions.

Also no surprise with 16 bits.

Marcel

Roedy Green · Apr 26, 2014

Here are the latest set of results. There are 100 times faster than in
the previous cut. I was using 1_000_000d to convert from ns to seconds
instead of 1_000_000_000d. Serves me right for not using TimeUnit.

1. These are all very fast. For my purposes implementing cache, the
speed is negligible. I have to compute hashes on only 10,000 to
20,000 files.

2. Sun's code works on a string directly. Most hashing require you to
convert the String to a byte first, which is probably as much work a
computing the hash itself.

3. Any kind of benchmark code because it does nothing useful can be
optimised out of existence be a clever compiler. I took no measures to
defeat that.

4. I also discovered that Adler is not so hot for collisions.

5. I expected Sun's String.hashCode to be fast, but to have lots of
collisions. It had none.

6. The speed penalty of the cryptographic hashes is not nearly a high
as I expected.

7. Whoever writes these functions is gradually making them faster.

Time to compute 25,000,000 iterations of 32-bit String.hashCode was
0.393 seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 32-bit FNV1a32 was 5.044
seconds.
There were 1 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 64-bit FNV1a64 was 4.217
seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 32-bit SunCRC32 was 6.133
seconds.
There were 1 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 32-bit Adler was 6.327
seconds.
There were 946 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 128-bit MD5 was 12.967
seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 160-bit SHA-1 was 23.390
seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 160-bit SHA was 22.690
seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 256-bit SHA-256 was 30.003
seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 384-bit SHA-384 was 30.902
seconds.
There were 0 collisions in 50,000 filenames
------
Time to compute 25,000,000 iterations of 512-bit SHA-512 was 30.538
seconds.
There were 0 collisions in 50,000 filenames

Mike Amling · May 5, 2014

Here are the latest set of results. There are 100 times faster than in
the previous cut. I was using 1_000_000d to convert from ns to seconds
instead of 1_000_000_000d. Serves me right for not using TimeUnit.

1. These are all very fast. For my purposes implementing cache, the
speed is negligible. I have to compute hashes on only 10,000 to
20,000 files.

2. Sun's code works on a string directly. Most hashing require you to
convert the String to a byte first, which is probably as much work a
computing the hash itself.

3. Any kind of benchmark code because it does nothing useful can be
optimised out of existence be a clever compiler. I took no measures to
defeat that.

4. I also discovered that Adler is not so hot for collisions.
Yep.
...

Time to compute 25,000,000 iterations of 32-bit String.hashCode was
0.393 seconds.
There were 0 collisions in 50,000 filenames

Did you write out the code that Java uses to compute String.hashCode, or
did you just call s.hashCode()? s.hashCode() could have been precomputed
and cached.

Did you implement using all 16 bits of each char of the String or only
the low-order 8 bits?

Mike Amling

Roedy Green · May 17, 2014

Did you write out the code that Java uses to compute String.hashCode, or
did you just call s.hashCode()? s.hashCode() could have been precomputed
and cached.

Did you implement using all 16 bits of each char of the String or only
the low-order 8 bits?

I just used String.hashCode. I figured the caching is a bonus that
would come with in the real world too.

I wonder if anyone knows with File.hashCode does. It seems to be
implementation dependent. I wondered if it would de safer to use
String hashCode on the filename converted to lower case and \ to / so
you get the same hash no matter how you specify minor differences in
filename.

Roedy Green · May 17, 2014

Did you implement using all 16 bits of each char of the String or only
the low-order 8 bits?

I converted to a byte[] using UTF-8 encoding. So the answer is mu.

Silvio · May 20, 2014

I just used String.hashCode. I figured the caching is a bonus that
would come with in the real world too.

I wonder if anyone knows with File.hashCode does. It seems to be
implementation dependent. I wondered if it would de safer to use
String hashCode on the filename converted to lower case and \ to / so
you get the same hash no matter how you specify minor differences in
filename.

Don't do that. You can not convert filenames to lowercase since only
Windows has case-insensitive filenames and treats / as \ in the File class.

Perhaps you could use the canonical name of the file?

Silvio

Parallel sorting algorithms...	0	Sep 7, 2012
String Hashing Algorithms	15	May 10, 2005
Paralle sorting algorithms...	0	Sep 7, 2012
Counter-intuitive io vs no-io time readings	6	Apr 9, 2014
Algorithms in Python, cont'd	0	May 3, 2012
Hashing long strings...	8	Aug 18, 2010
Benchmarking	0	Jun 20, 2004
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013

Benchmarking hashcode algorithms

Roedy Green

Marcel Müller

Roedy Green

Mike Amling

Roedy Green

Roedy Green

Silvio

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads