HashMap vs linear table lookup

R

Roedy Green

For sufficiently small tables, a linear array lookup to find a string
would be faster than a HashMap. Has anyone experimented to get a rough
idea where the break-even point is?
 
M

Mark Space

Roedy said:
For sufficiently small tables, a linear array lookup to find a string
would be faster than a HashMap. Has anyone experimented to get a rough
idea where the break-even point is?


If you try to measure this, I think it will depend on the length of the
Strings. The hashing algorithm for Strings used to only pick a few
characters to "save time." This didn't work well for large hash tables
since many similar strings would have the letters chosen in common,
resulting in poor hash distribution. So now I think every letter in a
String is used when computing the hash.

So anyway, yeah, don't forget about string size.
 
C

Christian

Roedy said:
For sufficiently small tables, a linear array lookup to find a string
would be faster than a HashMap. Has anyone experimented to get a rough
idea where the break-even point is?
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

well Intresting.. hope you accept this as valid test:

public class TestString {

/**
* @param args
*/
public static void main(String[] args) {

for (int size=1;size < 100; size++) {
if (testHashMap(size) <= testArray(size)) {
System.out.println("hashmap wins at: "+size);
}
}


}


private static long testHashMap(int size) {
Set<String> map = new HashSet<String>();
for (int i=0 ; i < size; i++) {
map.add(i+"+"+(-i)+"= 0");
}

long time = System.nanoTime();
for (int repetitions=0; repetitions < 10000; repetitions++ ) {
for (int i=0;i < size; i++) {
if (!map.contains(i+"+"+(-i)+"= 0")) {
throw new IllegalStateException();
}
}
}
return System.nanoTime()-time;
}

private static long testArray(int size) {
String[] array = new String[size];
for (int i=0 ; i < size; i++) {
array =i+"+"+(-i)+"= 0";
}

long time = System.nanoTime();
for (int repetitions=0; repetitions < 10000; repetitions++ ) {
for (int i=0;i < size; i++) {
String onSearch = i+"+"+(-i)+"= 0";
for (int x=0; x < size; x++) {
if (onSearch.equals(array[x])) {
break;
}
}
}
}
return System.nanoTime()-time;
}
}

on my machine with 5 elements it seems about even .. sometimes the
HashSet wins .. and sometimes the array.. after that HashSet is the winner.

Christian
 
R

Roedy Green

For sufficiently small tables, a linear array lookup to find a string
would be faster than a HashMap. Has anyone experimented to get a rough
idea where the break-even point is?
--

I wrote this code to experiment. Turns out binary search is the pits.
HashSet starts to get better around 11 items.
 
R

Roedy Green

I wrote this code to experiment. Turns out binary search is the pits.
HashSet starts to get better around 11 items.

The sample keys are nicely scrambled to make HashSet happier than it
deserves to be.


package com.mindprod.example;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Random;

/*
* Test relative speed of 3 methods of looking up a key to see if it is
is the list of legal keys.
* HashSet, binary search, linear search for different numbers of keys
n.
* <p/>
* composed with IntelliJ IDEA
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0, 2008-02-15
*/
public class TestFinding
{
// ------------------------------ FIELDS
------------------------------

/**
* lookup the random strings
*/
private static HashSet<String> h;

/**
* random number generator
*/
private final static Random wheel = new Random();

/**
* list of strings we want to look up
*/
private static String[] keys;

// -------------------------- STATIC METHODS
--------------------------

/**
* build lookup structure
*/
private static void build()
{
// build HashSet
h = new HashSet<String>( Arrays.asList( keys ) );

// build binary search
Arrays.sort( keys );
}

/**
* find all the keys by BinarySearch
*/
private static void findAllViaBinarySearch()
{
for ( String lookFor : keys )
{
int whereFound = Arrays.binarySearch( keys, lookFor );
// -ve means not found +ve tells where found.
}
}

/**
* find all the keys by HashSet
*/
private static void findAllViaHashSet()
{
for ( String lookFor : keys )
{
boolean found = h.contains( lookFor );
}
}

/**
* find all the keys by linear search
*/
private static void findAllViaLinearSearch()
{
for ( String lookFor : keys )
{
boolean found = false;
for ( String candidate : keys )
{
if ( lookFor.equals( candidate ) )
{
found = true;
break;
}
}
}
}

/**
* generate N random keys and their lookup structure
*/
private static void generate( int n )
{
keys = new String[n];
for ( int i = 0; i < n; i++ )
{
keys[ i ] = Long.toString( wheel.nextLong() );
}
}

// --------------------------- main() method
---------------------------

/**
* main method
*
* @param args not used
*/
public static void main( String[] args )
{

for ( int n = 1; n <= 10000; n *= 10 )
{
generate( n );
build();
final long t1 = System.nanoTime();
findAllViaBinarySearch();
final long t2 = System.nanoTime();
findAllViaHashSet();
final long t3 = System.nanoTime();
findAllViaLinearSearch();
final long t4 = System.nanoTime();

System.out
.println( "n:" +
n +
" binary search: " +
( t2 - t1 ) +
" HashSet: " +
( t3 - t2 ) +
" Linear: " +
( t4 - t3 ) );
}
// Here in output on my machine With JDK 1.6.0_04
//Linear is best for 10 or fewer, then HashSet.
// Binary search is never optimal.
// n:1 binary search: 21600 HashSet: 9720 Linear: 6400*
// n:10 binary search: 58720 HashSet: 11360 Linear: 11120*
// n:100 binary search: 707320 HashSet: 97520* Linear: 695960
// n:1000 binary search: 1294360 HashSet: 1255480* Linear:
10911040
// n:10000 binary search: 9823600 HashSet: 3920200* Linear:
1513846600

// Here in output on my machine with Jet
// n:1 binary search: 5520 HashSet: 4760 Linear: 1240*
// n:10 binary search: 3560 HashSet: 2480 Linear: 2400*
// n:100 binary search: 28160 HashSet: 5480* Linear: 44840
// n:1000 binary search: 408840 HashSet: 43560* Linear:
7862960
// n:10000 binary search: 9142440 HashSet: 2295320* Linear:
1852690520
}
}
 
L

Lord Zoltar

If you try to measure this, I think it will depend on the length of the
Strings.  The hashing algorithm for Strings used to only pick a few
characters to "save time."  This didn't work well for large hash tables
since many similar strings would have the letters chosen in common,
resulting in poor hash distribution.  So now I think every letter in a
String is used when computing the hash.

So anyway, yeah, don't forget about string size.

It would probably also be possible to write your own hash function, if
you know more about the input (such as avg length of strings, if the
strings only use a certain subset of all characters, etc...). Doing
that might provide better performance, in certain cases. You'd just
have to try it to see ;)
 
P

Patricia Shanahan

Roedy Green wrote:
....
The sample keys are nicely scrambled to make HashSet happier than it
deserves to be.
....

Although the approximate break-even point is interesting, if I had a
performance problem involving this sort of lookup I would do my own
experiments, using typical table contents and probe sequences from my
application.

Patricia
 
S

Stefan Ram

Lord Zoltar said:
It would probably also be possible to write your own hash function,

Here is an example, using a lookup indexed by the final
two letters of the month TLA. (This might actually be a
perfect hash, I am not sure about this.)

A small table might be used instead of the »switch« in »show«.
This is no hash map, but a direct array look up by a special
hash function, I believe O(1).

public class Main
{
int[] v;

public Main()
{ v = new int[ 256 ]; for( int i = 0; i < 256; ++i )v[ i ]=15;
v[ 97 ]= 0; v[ 98 ]= 9; v[ 99 ]= 4; v[ 101 ]= 2; v[ 103 ]= 9;
v[ 108 ]= 8; v[ 110 ]= 2; v[ 111 ]= 4; v[ 112 ]= 3; v[ 114 ]= 1;
v[ 116 ]= 3; v[ 117 ]= 1; v[ 118 ]= 4; v[ 121 ]= 0; }

int hash( final java.lang.String s )
{ return v[ s.charAt( 1 )]+v[ s.charAt( 2 )] ; }

void show( final java.lang.String s )
{ switch ( hash( s ))
{ case 4: case 3: case 5: case 8:
java.lang.System.out.println( 30 ); break;
case 11: break;
default: java.lang.System.out.println( 31 ); break; }}

public static void main( final java.lang.String[] args )
{ new Main().show( "Jun" ); new Main().show( "Oct" ); }}

It prints:

30
31
 
M

Mike Schilling

Roedy said:
For sufficiently small tables, a linear array lookup to find a
string
would be faster than a HashMap. Has anyone experimented to get a
rough
idea where the break-even point is?

I presume that it's small enough that, in all nromal cases, they're
 
N

none

Roedy said:
For sufficiently small tables, a linear array lookup to find a string
would be faster than a HashMap. Has anyone experimented to get a rough
idea where the break-even point is?
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

HashSet have to compute the hashCode of all Strings on the whole
content, but hashCode is compute only once per String.
String.equals first use length and then each caracter in order to return
result. You have to do it each time in linear comparison.

Basically,
if n is the total number of strings
p is the avg length,
hashset is O(n*p) to O(pn²), worse case if hash function is poor.
linear is O(pn²) to O(n²), best case if strings are very different
lengths or beginings.

Hence you have to profile your dataset (avg length, number of unique,
similarity, etc.) to know the break point.
 
L

Lew

none said:
HashSet have to compute the hashCode of all Strings on the whole
content, but hashCode is compute only once per String.
String.equals first use length and then each caracter in order to return
result. You have to do it each time in linear comparison.

The HashSet has to compute the hash exactly once for each entry when it's
inserted, as you seem to say.
if n is the total number of strings
p is the avg length,
hashset [sic] is O(n*p) to O(pn²), worse case if hash function is poor.

The String hashCode() computation should be reasonable for most purposes. But
even if the hash were worst-case (e.g., returned 17 for every String), there'd
still only be 'n' comparisons when doing a lookup.

And in big-O notation the 'p' factor goes away.
linear is O(pn²) to O(n²), best case if strings are very different
lengths or beginings.

How are you getting n squared? A linear lookup only needs to make n
comparisons to find a single input.
Hence you have to profile your dataset (avg length, number of unique,
similarity, etc.) to know the break point.

On lookup, i.e., to compare if a given input is already in the Set, you only
compute the hash for the input, not for the elements in the Set already. The
range for HashSet lookups should be O(1) to O(n) depending on the hash
quality. (O(n) occurring when all entries hash to the same value, which
devolves the HashSet into the linear lookup.)
 
R

Roedy Green

HashSet have to compute the hashCode of all Strings on the whole
content, but hashCode is compute only once per String.

So you will compute it to put the strings in the HashSet, and you will
compute it for each key to look them up. You don't need to compute it
when you chase a HashSet chain to compare colliding strings. They are
pre-computed.
 
B

Boris Stumm

Roedy said:
keys[ i ] = Long.toString( wheel.nextLong() );

What about intern'ed strings, i.e. Long.toString().intern()?

If string creation is rare and lookup is often needed, one
would probably do this, and then an .equals() would nearly
behave like an ==, which should much faster for the linear
case.
 
E

EJP

HashSet have to compute the hashCode of all Strings on the whole
So you will compute it to put the strings in the HashSet, and you will
compute it for each key to look them up.
As the poster said, the hashCode for a String is computed once per
String. It is not recomputed every time you call String.hashCode().
 
L

Lew

EJP said:
As the poster said, the hashCode for a String is computed once per
String. It is not recomputed every time you call String.hashCode().

Are you sure of that? The Javadocs don't mention that.

However, it is true that the Map does not invoke hashCode() but once for each
key, upon insertion, and once for the query object upon lookup.

P.S., please attribute your citations.
 
T

Thomas Schodt

Lew said:
Are you sure of that? The Javadocs don't mention that.

I guess java implementations are not required to
but the sun implementation caches the hash
(kinda changing the internal state of an immutable object...)

It is only recomputed every time for the empty String and for any other
String that has a hash of 0(zero).
 
L

Lew

Thomas said:
I guess java implementations are not required to
but the sun implementation caches the hash
(kinda changing the internal state of an immutable object...)

Actually, the immutability of an object only applies to its external state.
Lazy instantiation of things like hashCode() or (for classes other than
String) the toString() result is a standard idiom. One hopes that the String
implementation of hashCode() is suitably thread-safe, otherwise they are
changing the external state.
 
M

Mike Schilling

Lew said:
Actually, the immutability of an object only applies to its external
state. Lazy instantiation of things like hashCode() or (for classes
other than String) the toString() result is a standard idiom. One
hopes that the String implementation of hashCode() is suitably
thread-safe, otherwise they are changing the external state.

What would "suitably thread-safe" require here? The external behavior
is that all calls to hashCode() should return the same value. I think
this algorithm works fine:

1. define a private int _hashCode;
2. on a call to hashCode(), check the value of _hashCode. If it's
non-zero, return it.
3. otherwise, calculate the hash code of the string.
4. store this value in _hashcode
5. return _hashcode

The only thread-safety issue is that the store of _hashcode in step 4
be atomic, and I think that that's guaranteed by the JVM.. You could
minimize the number of hash code calculations by locking between steps
2 and 4, ensuring that all threads will see the changed value of
_hashcode once it's ready, but that's merely an optimization.
 
L

Lasse Reichstein Nielsen

As the poster said, the hashCode for a String is computed once per
String. It is not recomputed every time you call String.hashCode().

Even if they are computed every time, you don't need to know the hashCode
of elements in the HashSet to find them.

When you add an element to a HashSet/HashMap, the hash code is used
once, to pick a bucket, and the element is put into the linked list in
that bucket.

To look up an element, you use the hash code of the key/element to
search for, find a bucket based on that, and traverse the linked list
to find one that .equals() what you search for.

If the HashSet/HashMap grows its internal table, it needs the hash
codes again to put elements into new buckets.


The implementation of String in the Sun standard library does
cache the hashCode (unless it happens to be 0, which they take
as an uninitialized cache - but that's documented :).

/L
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top