HashSet performance

zero · Nov 12, 2005

In a chess program I am working on I was using a HashSet to store objects
of class Pattern, which contained a HashSet of Pieces. During execution
there can be anywhere between 1 and several hundred thousand patterns in
the Set. I was using Sets because there should be no duplicates of Pieces
in a Pattern, and no duplicate Patterns in the set.

However, the first tests showed that the program was unacceptably slow, and
quickly threw OutOfMemoryExceptions. I soon found that the bottleneck was
the Set.add() method. I replaced each occurance of HashSet with ArrayList,
adding my own code to keep duplicates out. The result is a lot faster (I
can't give an estimate, but trust me it's *a lot*), and uses three times
less memory...

My reason for picking Sets was simply the elimination of duplicates, and
the constant time for add() sounded good too. Next time I'll think twice
before using a HashSet.

John C. Bollinger · Nov 13, 2005

zero said:
In a chess program I am working on I was using a HashSet to store objects
of class Pattern, which contained a HashSet of Pieces. During execution
there can be anywhere between 1 and several hundred thousand patterns in
the Set. I was using Sets because there should be no duplicates of Pieces
in a Pattern, and no duplicate Patterns in the set.

However, the first tests showed that the program was unacceptably slow, and
quickly threw OutOfMemoryExceptions. I soon found that the bottleneck was
the Set.add() method. I replaced each occurance of HashSet with ArrayList,
adding my own code to keep duplicates out. The result is a lot faster (I
can't give an estimate, but trust me it's *a lot*), and uses three times
less memory...

My reason for picking Sets was simply the elimination of duplicates, and
the constant time for add() sounded good too. Next time I'll think twice
before using a HashSet.

You should always think carefully when choosing what type of object best
suits your purpose.

With HashSets you need to appreciate that every add() requires the hash
code of the new object to be computed, and may involve up to size()
equals() comparisons. If you have objects whose hash codes and equals()
methods are expensive, then storing large numbers of them in a HashSet
may be prohibitively expensive. All Sets' equals() and hashCode()
methods are based on their contents, and they can indeed be expensive
for perverse or numerous elements.

With ArrayLists, on the other hand, neither a hash code nor any equals()
comparisons need to be computed when a new element is added, but you
don't get guaranteed uniqueness. It typically _is_ faster to add
elements to an ArrayList than to a HashSet, but sometimes the guaranteed
uniqueness that comes with a Set is very handy. And HashSet generally
works nicely with elements that are suitable for it: those that have
fast (and good) hashCode() and equals() methods.

Roedy Green · Nov 13, 2005

If you have objects whose hash codes and equals()
methods are expensive, then storing large numbers of them in a HashSet
may be prohibitively expensive.

hashCode methods are not necessarily expensive.

For example Color just returns the int value of the colour.

hashCode for String is quite expensive though it is cached so it is
computed only once. But if you are not using interned strings, you
will end up computing it for every lookup.

public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;

for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}

Mark Thornton · Nov 13, 2005

John said:
You should always think carefully when choosing what type of object best
suits your purpose.

With HashSets you need to appreciate that every add() requires the hash
code of the new object to be computed, and may involve up to size()

The problem with large HashSet's is not usually the computation of hash
codes, but the memory consumed by each object added to the set. For
member of the set there is an Entry object (it implements Map.Entry). So
you can end up pushing the limits of memory (or encountering thrashing)
earlier than with more compact Collection implementations. The
performance hit is caused by either excessive garbage collection (cure
by specifying a larger minimum heap size) or memory paging activity.

Mark Thornton

zero · Nov 13, 2005

If you have objects whose hash codes and equals()
methods are expensive, then storing large numbers of them in a HashSet
may be prohibitively expensive.

Click to expand...

hashCode methods are not necessarily expensive.

For example Color just returns the int value of the colour.

hashCode for String is quite expensive though it is cached so it is
computed only once. But if you are not using interned strings, you
will end up computing it for every lookup.

public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;

for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}

I still use hashCode() and equals() to check if a certain objects is
already in the ArrayList. I do this in my own code now, instead of
letting the HashSet do it. However, in the hashCode() implementation I
did change from using "pieces.toString().hashCode()" (where pieces was a
HashSet) to a different algorithm (see code below). So that may be part
of why it's so much faster now. I did not know computing the hashCode
of a String was that expensive.

This is my new hash computation code:

toChar() returns a char, square.getRank() and square.getFile() return
ints.

public abstract class Piece
{
// ...

public int hashCode()
{
return toChar() + square.getRank() + square.getFile();
}
}

public class Pattern
{
private ArrayList<Piece> pieces;

// ...

public int hashCode()
{
int code = 0;

for(Piece p : pieces)
code += p.hashCode();

return code;
}
}

I have now run into a different problem: when serializing a HashMap and
then de-serializing it on a subsequent run of the program, some of the
entries are missing. I'll do some testing, and if I can't figure it out
I'll ask you guys for help.

Chris Uppal · Nov 13, 2005

zero said:
public int hashCode()
{
return toChar() + square.getRank() + square.getFile();
}

public int hashCode()
{
int code = 0;

for(Piece p : pieces)
code += p.hashCode();

return code;
}

I think that part of the unexpectedly poor performance of HashSet here, is that
the above are very poor hash functions. They will map many naturally occurring
unequal instance of Piece to the same code, and similarly for Pattern, so your
HashSets will have many collisions, and consequently perform much worse than
you might expect.

It is, in general, a bad idea to use addition as the sole basis of a hash
function, unless you know something special about the values that are being
added. Good hash functions tend to be based on one of:

multiplication by a prime plus addition;
bit-shifting the values into non-overlapping (bitwise) ranges;
bit-shifting and XORing;
?...

(Just XORing is nearly always a bad idea too, but it can be fine provided that
you bitshift the values first). Some examples (all untested, and almost
certainly suboptimal):

public int hashCode()
{
return
(toChar() * 999961)
+ (square.getRank() * 997)
+ square.getFile();
}

public int hashCode()
{
return
(toChar() << 16)
| (square.getRank() << 8)
| square.getFile();
}

public int hashCode()
{
return
((toChar() << 1)
| (square.getRank() << 1)
| square.getFile();
}

At a guess, the last is probably a lot worse than the first two in this case,
although the technique is valid in general (especially if you rotate the bits
rather than just allowing them to be shifted "off the end" as in my example).

-- chris

Chris Uppal · Nov 13, 2005

zero said:
public class Pattern
{
private ArrayList<Piece> pieces;

// ...

public int hashCode()
{
int code = 0;

for(Piece p : pieces)
code += p.hashCode();

return code;
}
}

I've just noticed something else that might be worth mentioning. The
hashCode() is dependent on the ordering of the Pieces. That might be
deliberate, but it seems a bit suspicious to me. If your equals() method is
/not/ dependent on the order, then the combination will break (and may explain
your serialisation problem).

-- chris

zero · Nov 13, 2005

I've just noticed something else that might be worth mentioning. The
hashCode() is dependent on the ordering of the Pieces. That might be
deliberate, but it seems a bit suspicious to me. If your equals()
method is /not/ dependent on the order, then the combination will
break (and may explain your serialisation problem).

-- chris

How is it dependent on the ordering? It's a sum of all positive numbers.

zero · Nov 13, 2005

I think that part of the unexpectedly poor performance of HashSet
here, is that the above are very poor hash functions. They will map
many naturally occurring unequal instance of Piece to the same code,
and similarly for Pattern, so your HashSets will have many collisions,
and consequently perform much worse than you might expect.

Actually the bad HashSet performance didn't use this code. But you're
right about the collisions, I didn't think of that.

It is, in general, a bad idea to use addition as the sole basis of a
hash function, unless you know something special about the values that
are being added. Good hash functions tend to be based on one of:

multiplication by a prime plus addition;
bit-shifting the values into non-overlapping (bitwise) ranges;
bit-shifting and XORing;
?...

(Just XORing is nearly always a bad idea too, but it can be fine
provided that you bitshift the values first). Some examples (all
untested, and almost certainly suboptimal):

public int hashCode()
{
return
(toChar() * 999961)
+ (square.getRank() * 997)
+ square.getFile();
}

public int hashCode()
{
return
(toChar() << 16)
| (square.getRank() << 8)
| square.getFile();
}

public int hashCode()
{
return
((toChar() << 1)
| (square.getRank() << 1)
| square.getFile();
}

At a guess, the last is probably a lot worse than the first two in
this case, although the technique is valid in general (especially if
you rotate the bits rather than just allowing them to be shifted "off
the end" as in my example).

-- chris

Thanks for the suggestions, I'll check these out and do some testing.

Chris Uppal · Nov 13, 2005

zero said:
How is it dependent on the ordering? It's a sum of all positive numbers.

Sorry, my mistake. I had mentally translated it into one of the other forms I
mentioned, which would be order dependent.

(Coughs, and changes the subject hurriedly...)

Incidentally, from the code you posted, it looks to me as if you can only have
2**11 = 2048 possible distinct Pieces. If so then it would probably be worth
investigating creating them all upfront, and then only ever passing around
references to those instances. A sort 2048-way Singleton pattern, or very like
the way that Enums are implemented. Of course, you may already be doing that,
but if not then it should save a lot of space and also allow you to compare
them using == and the default (identity) hashCode(). The downside is that
moving Pieces and [de]serialisation would be more complicated.

Also, by given each Piece a unique 11-bit index, you could then probably devise
significantly better (in space and/or time, possibly both) internal
representations of a Pattern than as an ArrayList of Pieces.

-- chris

Roedy Green · Nov 13, 2005

return toChar() + square.getRank() + square.getFile();

you could make that slightly better distributed with
return toChar() ^ square.getRank() ^ square.getFile();

You could make it even better distributed with:
return (toChar() << 8) | (square.getRank()<<4) | square.getFile();

Both of these are still pretty quick.

Roedy Green · Nov 13, 2005

Actually the bad HashSet performance didn't use this code. But you're
right about the collisions, I didn't think of that.

The only thing a hashCode has to do is avoid collisions as best it
can. Other than speed, there is nothing else to think about.

zero · Nov 13, 2005

Incidentally, from the code you posted, it looks to me as if you can
only have 2**11 = 2048 possible distinct Pieces. If so then it would
probably be worth investigating creating them all upfront, and then
only ever passing around references to those instances. A sort
2048-way Singleton pattern, or very like the way that Enums are
implemented. Of course, you may already be doing that, but if not
then it should save a lot of space and also allow you to compare them
using == and the default (identity) hashCode(). The downside is that
moving Pieces and [de]serialisation would be more complicated.

Also, by given each Piece a unique 11-bit index, you could then
probably devise significantly better (in space and/or time, possibly
both) internal representations of a Pattern than as an ArrayList of
Pieces.

-- chris

Since it took me quite some time to program all the move rules, I think
I'll keep it as is - at least for now. First I want to get the program
working correctly. Once it's up and running I'll see about improving
performance.

btw I believe I have found the serialization problem I was having, and as
usual it was a stupid (*hitting myself on the head*) mistake. Some Pieces
were being changed after the Pattern they were in was added to the HashMap.
This of course changed the hashCode of the Pattern, causing some Patterns
in the Map to have the same hashCode but different locations.

I think I'll go crawl in some dark hole until the shame subsides...

Thomas Hawtin · Nov 14, 2005

You are allocating an iterator every time this method is called.
Allocation isn't particularly expensive, but neither is it anywhere near
free. You might find it help to set -Xms to the same value as -Xmx.

When you combine this with potential collision problems, then we have
problems.

public int hashCode()
{
return
(toChar() << 16)
| (square.getRank() << 8)
| square.getFile();
}

You will probably find that the performance of this hash function is
highly JRE implementation dependent. Sun's 1.4.1, IIRC, HashMap/HashSet
requires a good spread of values in the least significant bits.

The old multiply by a small but not too simple prime and accumulate
works for me (isn't so good if you have two classes that do the same
calculation with the same data).

private int hash;
@Override
public int hashCode() {
int hash = this.hash;
if (hash == 0) {
hash = 1; // Possibly all fields are 0...
hash = hash*37 + charValue;
hash = hash*37 + rank;
hash = hash*37 + file;
this.hash = hash;
}
return hash;
}

Tom Hawtin

zero · Nov 14, 2005

The old multiply by a small but not too simple prime and accumulate
works for me (isn't so good if you have two classes that do the same
calculation with the same data).

private int hash;
@Override
public int hashCode() {
int hash = this.hash;
if (hash == 0) {
hash = 1; // Possibly all fields are 0...
hash = hash*37 + charValue;
hash = hash*37 + rank;
hash = hash*37 + file;
this.hash = hash;
}
return hash;
}

Tom Hawtin

I currently have

in Piece:
public int hashCode()
{
if(hash == 0)
(toChar()*19) + (square.getRank()*37) + square.getFile();
return hash;
}

in Pattern:
public int hashCode()
{
if(hash == 0)
{
for(Piece p : pieces)
hash += p.hashCode();
}

return hash;
}

Given that there are only 12 possible values for toChar, 8 for getRank,
and 8 for getFile, I think the Piece hashCode is unique, and all
possible hash values fall in a small enough range.

For Pattern an Iterator is being created, but I think it's acceptable -
especially with caching the code. Of course the code has to be reset to
0 each time the Piece or Pattern changes, but that's still better than
re-calculating it every time it's needed.

Thanks everyone for the suggestions, and if you can think of any more
don't hesitate to let me know

zero

performance of HashSet with strings and integers	9	Oct 7, 2009
Comparing elements from one hashset to another	22	Mar 30, 2006
Extending java.util.Set	11	Feb 15, 2006
Processing in Python help	0	Aug 31, 2022
TF-IDF	1	Aug 19, 2021
URL's and Set's do not mix - DNS lookups are compulsory	4	May 17, 2006
Numpy Performance	4	Apr 23, 2009
Parking lot C#	5	Nov 5, 2020

HashSet performance

zero

John C. Bollinger

Roedy Green

Mark Thornton

zero

Chris Uppal

Chris Uppal

zero

zero

Chris Uppal

Roedy Green

Roedy Green

zero

Thomas Hawtin

zero

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads