hashCode() for Custom classes

S

sasuke

Hello to all Java programmers out there.

I am currently faced with the task of providing a logical equals()
method for my domain / business classes. This job being done, I now
have to override the hashCode() so that when an object of this class
is used as a key in a Map, the Map behavior is well defined.

I have thought of a couple of ways to do this but would like to give
inputs on my thoughts:

public class Agent {
String name;
String code;
Address addr;

/**
* The hash code of Agent is the combined hash code of
* it's constituent elements.
*/
public int hashCode() {
return(name.hashCode() +
code.hashCode() +
addr.hashCode());
}

/**
* The hash code of Agent is the hash code of the
* string representation of it's constituent elements.
* THis is guaranteed to be unique since the string
* representation of no objects is the same unless they
* actually are equal. This approach might also benefit
* from the fact that Strings in case of Java are pooled
* so a hashCode string once calculated for an object will
* remain for it's lifetime in the string pool
*/
public int hashCode() {
return((name + code + addr).toString().hashCode());
}
}

Which approach would be better? The one utilizing the nested calls to
the hashCode() of constituent elements or the String approach? Also
how would I prove whether the hashCode() algorithm chosen by me will
in all cases generate unique hash codes?

Any kind of input / feedback / links / suggestions would be greatly
appreciated.

Thanks and regards,
~/sasuke
 
K

Kevin McMurtrie

Lew said:
I thought the canonical multiplier for 32-bit hashes was 31.

I'll google around and get back on that if I find anything.

The idea is to shift data to higher values quickly while at the same
time remixing old data eternally. Either 31 or 37 should work well in
most uses. Just being a moderately sized odd number can be good enough.
If the inputs are already hashed well, just XOR them all.

For higher power hashing you can blend data into the linear congruential
pseudorandom number generator cycle seen in Random.java. In my
experience, that magic 0x5DEECE66DL makes for very even hashing even
when the input data is full of patterns. Even stronger would be a SHA1
hash. Whether or not all that extra math is better than a few hash
collisions depends on how fast the equals method is.
 
T

Thomas Fritsch

Eric said:
sasuke wrote: [...]
public class Agent {
String name;
String code;
Address addr; [...]
/**
* The hash code of Agent is the hash code of the
* string representation of it's constituent elements.
* THis is guaranteed to be unique since the string
* representation of no objects is the same unless they
* actually are equal. This approach might also benefit
* from the fact that Strings in case of Java are pooled
* so a hashCode string once calculated for an object will
* remain for it's lifetime in the string pool
*/
public int hashCode() {
return((name + code + addr).toString().hashCode());
}
}

This is very bad, because it assumes that the toString()
method of the Address class produces nothing but characters
that are involved in its equals() method. Or, turning it
around, it assumes that toString() produces no characters
that are influenced by anything equals() doesn't consider.
(Also, it means that either Address or its toString() must
be final so they can't be overridden in a subclass that might
violate the assumption.)
This is also bad because
(name + code + addr)
tends to produce many temporary String objects, thus possibly costing much
performance during creating and garbage-collecting.
Remember that hashCode() should be designed to execute fast, especially
faster than equals(). This is because, if collisions are rare, HashMaps
will call hashCode() much more often than equals().
 
A

Andreas Leitgeb

Thomas Fritsch said:
This is also bad because
(name + code + addr)
tends to produce many temporary String objects,

One new StringBuilder and one new String-object.
Remember that hashCode() should be designed to execute fast, especially
faster than equals().

Each HashMap-access takes one hashCode()-evaluation and
at least one, but possibly more .equals()-calls. Even, if
each hash-bucket has only one Object (i.e. collision-free),
it must still make sure, that it's the right Object, before
returning it.
 
L

Lasse Reichstein Nielsen

Thomas Fritsch said:
Remember that hashCode() should be designed to execute fast, especially
faster than equals().

It doesn't have to. The point of a hash code is not that it speeds up
equality comparison (it doesn't, equality takes the time equality takes,
hash code or not).
Rather it allows you to reduce the number of candidates for an
equality-test.
This is *not* the same as just checking the hash code as the first
part of the equality-test.

A hash code is *numeric*, which allows numeric computations that
you couldn't do on a full object, e.g., picking a bucket for a
hash table. Being numeric is important. Being faster than equality
isn't.

If the only hashCode() implementation you can find takes twice as long
as an equality check, you would still want it, so that you can use a
HashMap. No matter how fast plain equality is, it alone won't let you
find an element faster than brute force search.
This is because, if collisions are rare, HashMaps
will call hashCode() much more often than equals().

If the hash table is sparse, then yes, otherwise, no.

Each lookup in a HashMap computes one hash-code (of the lookup
key) and then does equality comparison on each element in the
corresponding bucket (although that might start out by comparing
pre-computed hash codes).

/L
 
P

Patricia Shanahan

Peter said:
I admit...I don't get his reasoning.

First of all, 0 is a very unusual hash code to return. I suppose it's
always possible, but unless you're dealing with some type that is
32-bits or less and just returns its own value as the hash code, it
doesn't seem likely.

Integer has only 32 bits of state, returns its own value as the hash
code, and zero is not an unusual value for an integer.

Patricia
 
M

Mark Thornton

Patricia said:
Integer has only 32 bits of state, returns its own value as the hash
code, and zero is not an unusual value for an integer.

Patricia

Zero is a very reasonable value to return for something like an empty
set. I have quite a few classes which return zero if they represent null
operations.

Mark Thornton
 
L

Lasse Reichstein Nielsen

Peter Duniho said:
If I'm wrong about my reasoning, I want to know about it.

I don't think so. I was about to say something myself when I saw
that you had already covered it.

The argument:
"If zero was used as the initial value in step 1, the overall hash
value would be unaffected by any such initial fields, which could
increase collisions"
is bogus. Collisions are independent of initial value and zero-fields,
at least as long as we are only comparing hashCodes from one class.

If we have several similar classes that are going to be compared
against each other, I would probably pick a separate seed for each
class, so that instances of different classes with similar content
didn't collide as easily.

/L
 
S

Stefan Ram

Peter Duniho said:
0 is a very unusual hash code

0 is as usual or unsual as 443939208.
to return.

If 0 was unusual as a hash code indeed, it should not matter,
whether it is »returned«, or »used«, or »calculated«, or
something else.
 
M

Mark Space

Eric said:
public int hashCode() {
return 2;
}

Well except this obviously produces a rather large number of collisions,
so in fact it's markedly worse than the abracadabra version.
 
M

Mark Space

I'd vote for most of the comments on this thread, which are
overwhelmingly similar. One trick I did not yet see is:
public class Agent {
String name;
String code;
Address addr;

int hashcode;
/**
* The hash code of Agent is the combined hash code of
* it's constituent elements.
*/
public int hashCode() {
if( this.hascode == 0 )
{
this.hashcode =
> name.hashCode() +
code.hashCode() +
addr.hashCode());
}
return this.hashcode;

As others have mentioned you should probably XOR these values together
instead of add them.
 
S

Stefan Ram

Patricia Shanahan said:
More generally, it seems unlikely to me that hash codes are
uniformly distributed over all the int values.

There are (at least) two points of view:

The conceptual point of view merely defines the requirements
for a hash function. It does not refer to a specific programming
language, specific classes or makes special assertions about
any specific hash value.

Another point of view can study how a hash function is
implemented by the operation

http://download.java.net/jdk7/docs/api/java/lang/Object.html#hashCode()

in java.lang.Object and in classes extending java.lang.Object
of the Java SE standard library.

The second point of view might find
values close to zero to be overrepresented

, using a certain set of programs to obtain the frequency data.

A similar phenomenon in the real world is

http://en.wikipedia.org/wiki/Benford's_law

Usually, one will prefer a hash function with an even
distribution of hash values for a typical distribution
of data values or a typical set of objects.

There should not be a bias regarding special values.

So if one already has a good distribution, supressing
certain values that are deemed »magical value« will
make the distribution worse, like a good random number
generator would be made worse if someone adds code to
supress 0 as a result, because he does not deem 0 »to
be random«.

Funny enough, some people consider 17 to be the most
random number.

http://consc.net/notes/pick-a-number.html
http://google.to/search?q=cache:web.media.mit.edu/~guy/blog/entry.php?24110401

Of course, »randomness« is not a property of a single number,
but of a distribution, and there are no »good hash values«
or »bad has values« - only »good hash functions« and
»bad hash functions«.
 
R

Roedy Green

As others have mentioned you should probably XOR these values together
instead of add them.

I use XOR primarily as a sort of documentation that I am actually just
combining hashcodes.

I wonder if anyone has actually thought out if XOR actually has any
advantage. I know XOR is at least as good as plus, but I am not sure
it is actually better. In theory it could be faster to compute since
there are no carries, but in actuality all modern machines would
handle them both in 1 cycle.
 
R

Roedy Green

What about cycling through a series of multiplications by a prime (e.g., 37),
as others have suggested?

XOR is going to much must faster than multiply. Multiply though
smears values with only low order bits on in randomness over the
entire 32 bits.

I discuss some of the pros and cons of XOR, (but not comparing with
pure add) at http://mindprod.com/jgloss/hashcode.html
 
R

Roedy Green

Another point is that hashes are explicitly ints, not bitmaps. It's more
natural and self-documenting to use integer operations on them.

On the other hand, hashCodes are just patterns, not quantities, so you
could also argue it makes sense to use the integer bit manipulator
operators on them.

I was wondering if you get any better spread with + or ^ under some
circumstances.

Assume your numbers were all multiples of 4. With +, the hashcode
would always end in 2 zeros. With xor it would end in either 2 zeros
or 2 ones, but in neither case would you get those lower 2 bits nicely
scrambled.
 
M

Mark Space

Patricia said:
I don't think it is the only thing that matters. In order for hash
tables to work, hash codes need to lead to a reasonably uniform
distribution of objects among buckets.

Right. Hashcodes are not just random patterns. They are values. They
are indexes into a table and therefore integers.

They need to be random primarily on their lower order bits because most
hash functions will just use modulo arithmetic to clip the hash code to
the proper table size.

However they need to be random in all bits because you don't know the
table size ahead of time (other than it's, hopefully, less than the size
of an integer).
 
M

Mark Space

Logan said:
One thing nobody else has mentioned so far (at least I think not -- this
thread is long and I may have missed it) is that when combining, sometimes
it may be best to leave out the hash code of some of the members.
This could be particularly true if, say, instances of D tend to
contain a really long string.

I'm pretty sure this is false. Early versions of Java's hashcode() for
Strings didn't use the whole string because of performance concerns. It
turns out that this leads to lots of collisions and poor hashing when
the parts you chose to leave out are the parts that actually differ.

So the learning was to eat the performance and hash the whole object.

I gave an example up-thread how to compute the hash code once and then
just return the computed hash code. I think that's the better method if
performance is a concern.
 
R

Roedy Green

I don't think it is the only thing that matters. In order for hash
tables to work, hash codes need to lead to a reasonably uniform
distribution of objects among buckets.

Consider the HashMap just truncates high order bits. So you want
maximal variation in the low order bits of the hashCode.

I suppose we might learn something from some simulations using
different sorts of hashCode algorithms.
 
R

Roedy Green

Early versions of Java's hashcode() for
Strings didn't use the whole string because of performance concerns.

In JDK 1.0.x and 1.1.x the hashCode function for long Strings worked
by sampling every nth character. This pretty well guaranteed you would
have many Strings hashing to the same value, thus slowing down
Hashtable lookup. In JDK 1.2 the function has been improved to
multiply the result so far by 31 then add the next character in
sequence. This is a little slower, but is much better at avoiding
collisions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top