why in class Boolean, hashcode() of "true" is 1231 and of "false" is1237?

P

Patricia Shanahan

Lew said:
Please include a body to your posts.



Why not?

The Integer hashCode() is the primitive int the Integer represents. 0
and 1 are particularly common int values. Given the lack of any
advantage to choosing those values for the Boolean hash codes, it makes
sense to me to pick a different pair of values.

Patricia
 
T

Tom Anderson

Please include a body to your posts.



Why not?

Because they only differ in one bit. It is often thought a good thing for
hashes of different objects to differ in as many bits as possible.
The low-order bits are particularly important, given that hashes are
often reduced to a smaller range by modding, but all the bits might be
used for something, so having them be different is a useful hedge.

That argument suggests that 0 and -1 might be the best values. I wonder
what Peter would think of that?

tom
 
L

Lew

Because they only differ in one bit. It is often thought a good thing for
hashes of different objects to differ in as many bits as possible.

By how many bits is it possible for a universe of two values to
differ?

1231 and 1237 differ in three bits. If differing in as many bits as
possible is good, why didn't they choose values that differed by
more? 11 and 17 differ by the exact same bits, after all.
  The low-order bits are particularly important, given that hashes are
often reduced to a smaller range by modding, but all the bits might be
used for something, so having them be different is a useful hedge.

That argument suggests that 0 and -1 might be the best values. I wonder
what Peter would think of that?

Patricia's argument that Boolean hashes need to differ from common
Integer hashes might apply, assuming that there's any benefit to
that. How often are Boolean hashes compared to Integer hashes? How
often are hashcodes picked for a class so that it can be compared to a
different class?

I suspect that some early Java implementor decided that two arbitrary
prime numbers were just fine, and didn't happen to like 17.
 
L

Lew

Lew said:
How often are Boolean hashes compared to Integer hashes?  

Answering my own question: as often as Integer and Boolean members
appear together in a class for which one must design a hash code that
distributes well over arbitrary-size hash lists.
 
R

Roedy Green

why in class Boolean, hashcode() of "true" is 1231 and of "false" is 1237?
It is best to repeat your question in the body.

Primes scramble nicely when you xor, add or modulo them. They don't
collapse down onto a small set of values


--
Roedy Green Canadian Mind Products
http://mindprod.com
Your old road is
Rapidly agin'.
Please get out of the new one
If you can't lend your hand
For the times they are a-changin'.
 
L

Lew

Peter Duniho said:
I'm not sure what you mean there.  Ideally, just because a class has both  
Integer and Boolean members, that wouldn't mean you'd actually be  
_comparing_ Integer hashes to Boolean hashes.

What I mean there is that comparison is not relevant, composition is.
 
T

Tom Anderson

By how many bits is it possible for a universe of two values to
differ?

A rhetorical question, i suspect, but the answer is ll of them. Pick an x,
an use ~x for the other.
1231 and 1237 differ in three bits. If differing in as many bits as
possible is good, why didn't they choose values that differed by more?

A very good question - i don't know, and i'm very surprised by those
choices.
Patricia's argument that Boolean hashes need to differ from common
Integer hashes might apply, assuming that there's any benefit to
that.

True, but then why not, eg 1011818350 and -1011818351?
How often are Boolean hashes compared to Integer hashes? How often are
hashcodes picked for a class so that it can be compared to a different
class?

Might you want to have a map keyed by arbitrary objects for some reason?
I suspect that some early Java implementor decided that two arbitrary
prime numbers were just fine, and didn't happen to like 17.

I suspect some kind of kabbalistic significance.

tom
 
P

public boolean

He did -- he said " 3ks". Not intelligible, but it does seem to
constitute a body.
The Integer hashCode() is the primitive int the Integer represents.

Bletch!

That will have a pretty poor distribution of hash values, because the
integer values that arise in actual running code don't tend to be
completely randomly distributed. Integers richer in factors are more
common, as are smaller integers.

Multiplying the integer's value by a predetermined prime number near
(Integer.MAX_VALUE times 1.61803398) would have been a much better
choice of hash function, scattering the hash values pretty randomly
about the full range of integer values, varying a lot in each bit
position, whatever the input distribution. (Note that that multiplier
would actually be negative, due to wrap-around; technically, it wouldn't
be prime but its negation would be. 1.61803398 is close to the golden
ratio, the "most irrational" number, so producing a much more uniform
distribution that won't "resonate" with Integer.MAX_VALUE and produce
distinct clumps. Going near MAX_VALUE itself would produce clumps around
0 and around MAX_VALUE and -MAX_VALUE-1; going near 1/2 of MAX_VALUE,
there and at MAX_VALUE/2 and -MAX_VALUE/2; and so on. Given that the
input is biased towards small integers.)

Note that to the extent that the input just isn't very diverse, neither
will be the hash values, however distributed, but the farther apart the
distinct hash values tend to be, and the more evenly scattered, the
better hash tables will perform within the constraints on their
performance imposed by that low diversity.

Integer as a key to a hash table isn't a ridiculously unlikely case,
either -- it can be more efficient than an array if most of the array's
cells are going to be nulls or zeros, and it's going to be large, and it
can arise if you have heterogeneous keys some of which turn out to be
integers. A sparse vector is one possible case -- for a hash table to be
more efficient than an array for this job, the vector may need to have
hundreds of dimensions, but even then the integer keys are distributed
towards very small numbers in comparison to Integer.MAX_VALUE. Those
from zero to 999 comprise less than one part in four million of the
total range.

This is what the present hash function does to a sparse vector with ten
non-zero values out of a thousand, put in a hash table with 16 buckets
(more than you get with a 75% load factor, 13).

If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807, 901,
and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and 5.

Quite a few hash collisions.

It gets worse if there's a tendency for the integers used as keys to
have biased low-order bits. It can get much worse.
 
L

Lew

I understand your confusion. I focused on comparison early on, and agree with
you that comparison between Booleans and Integers should be fairly rare,
occurring in the relatively uncommon situations that Patricia mentioned.
However, I then realized that comparison is not the only situation where you
need hash codes; composition is also relevant.

In that non-comparative context, where you compose the hash codes of several
class elements to get an object's overall hash, it would be useless for a
common non-null element value (say, FALSE) to hash to zero, and not very
useful to hash to one. Thus some clever person decided not to use zero and
one as the hash codes of Boolean.
 
M

Mark Space

public said:
If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807, 901,
and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and 5.

Quite a few hash collisions.


Choose 10 random numbers between 1 and 16. How many will be duplicates?

I don't see how this arbitrary list of number is somehow significant in
proving that the hashcodes from Integer are poorly distributed.
 
E

Eric Sosman

public said:
He did -- he said " 3ks". Not intelligible, but it does seem to
constitute a body.


Bletch!

That will have a pretty poor distribution of hash values, because the
integer values that arise in actual running code don't tend to be
completely randomly distributed. Integers richer in factors are more
common, as are smaller integers.

So? If you compute a hash code h(k) for each Integer,
then whenever k is a frequently-occurring Integer value, h(k)
will be a frequently-occurring hash value. At the best you
can achieve a permutation of the Integer values; if not so lucky
you might choose an h() that maps the entire Integer domain into
a smaller range. In no event will any conceivable h() be an
improvement over h(k)=k.intValue(). h(k1) will collide with h(k2)
exactly as often as k1 and k2 themselves collide.
Multiplying the integer's value by a predetermined prime number near
(Integer.MAX_VALUE times 1.61803398) would have been a much better
choice of hash function, scattering the hash values pretty randomly
about the full range of integer values, varying a lot in each bit
position, whatever the input distribution.

Nonsense. If the input consists of 100 zeroes, 10 ones,
and one each of 2..9, the hashes will consist of 100 h(0)s,
10 h(1)s, and one each of h(2..9). Nothing has changed,
except possibly for the worse if h(0)==h(6), say.
[... more pseudo-numerological nonsense ...]

As shown above, an Integer hashCode() other than identity
could not improve the scattering of different Integer values
from each other. There is one conceivable case where another
hashCode() might help: if the keys of the Hash{Table,Map,Set}
are a mixture of Integers and other types[*], it is possible that
a carefully-chosen hashCode() for Integers might help separate
them from the BigInteger and String and ... keys' hashCode()s.
That sort of thing needs far too many assumptions about the key
population to warrant a general solution.

[*] Personally, I've only once found a use for such a mixed-
key table: The keys in that case were Integers, Longs, Doubles,
and Strings. I maintain that it is *not* worth while to try to
discover a suite of hashCode() implementations that would in
some sense "optimize" the spread of hash values; it was a corner
case, and hashCode() should cater more to the common case of a
homogeneous key population.
 
P

public boolean

Mark said:
Choose 10 random numbers between 1 and 16. How many will be duplicates?

I don't see how this arbitrary list of number is somehow significant in
proving that the hashcodes from Integer are poorly distributed.

It's not just the distribution of the hash values over the space of
possible integer values that counts. It's the distribution of the hash
values over the actual hash buckets produced in an actual hash table,
which typically is much smaller. You can have few collisions in a large
space and still end up with many in a small space, especially if the
small space's size is a power of two and the low-order bits of the hash
are biased. Even integers are more common than odd, so in any hash table
with integer keys using the current integer hash code and with an even
number of buckets, it's likely the odd buckets will be underutilized and
the even ones will have an unnecessary collision or several.
 
P

public boolean

Eric said:
Nonsense. If the input consists of 100 zeroes, 10 ones,
and one each of 2..9, the hashes will consist of 100 h(0)s,
10 h(1)s, and one each of h(2..9). Nothing has changed,
except possibly for the worse if h(0)==h(6), say.
[... more pseudo-numerological nonsense ...]

Your insulting attitude in response to a perfectly civil post baffles me.

I'll copy and paste my reasoning from my reply to Mark Space, who had a
similar objection but was far more polite.

It's not just the distribution of the hash values over the space of
possible integer values that counts. It's the distribution of the hash
values over the actual hash buckets produced in an actual hash table,
which typically is much smaller. You can have few collisions in a large
space and still end up with many in a small space, especially if the
small space's size is a power of two and the low-order bits of the hash
are biased. Even integers are more common than odd, so in any hash table
with integer keys using the current integer hash code and with an even
number of buckets, it's likely the odd buckets will be underutilized and
the even ones will have an unnecessary collision or several.
 
L

Lew

public said:
Eric said:
Nonsense. If the input consists of 100 zeroes, 10 ones,
and one each of 2..9, the hashes will consist of 100 h(0)s,
10 h(1)s, and one each of h(2..9). Nothing has changed,
except possibly for the worse if h(0)==h(6), say.
[... more pseudo-numerological nonsense ...]

Your insulting attitude in response to a perfectly civil post baffles me.

Oh, Christ, it *is* Twisted again. I was afraid of that.

Quit hiding who you are, or are you embarrassed to admit it?

Re-plonk.
 
E

Eric Sosman

public said:
Eric said:
Nonsense. If the input consists of 100 zeroes, 10 ones,
and one each of 2..9, the hashes will consist of 100 h(0)s,
10 h(1)s, and one each of h(2..9). Nothing has changed,
except possibly for the worse if h(0)==h(6), say.
[... more pseudo-numerological nonsense ...]

Your insulting attitude in response to a perfectly civil post baffles me.

I apologize for the tone of my response, and wish I had
written less antagonistically. However, I stand by the
content.
I'll copy and paste my reasoning from my reply to Mark Space, who had a
similar objection but was far more polite.

It's not just the distribution of the hash values over the space of
possible integer values that counts. It's the distribution of the hash
values over the actual hash buckets produced in an actual hash table,
which typically is much smaller. You can have few collisions in a large
space and still end up with many in a small space, especially if the
small space's size is a power of two and the low-order bits of the hash
are biased. Even integers are more common than odd, so in any hash table
with integer keys using the current integer hash code and with an even
number of buckets, it's likely the odd buckets will be underutilized and
the even ones will have an unnecessary collision or several.

"Even integers are more common than odd" is a claim that
fails to convince me. Have you instrumented a suite of programs
to count the abundance of even and odd Integer keys in maps?
Or can you provide a link to such a result?

But even if it were true, I do not see how it could matter.
Have you studied the implementation of java.util.HashMap? You
will find that the first thing it does with a hashCode() value
is to perturb it as follows:

static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}

That is, bit 0 of the effective hash value depends on all of
bits 0, 4, 7, 12, 16, 19, 20, 24, and 27 of the hashCode().
The choice of even-numbered or odd-numbered bucket depends
on nine bits of the hashCode(), not on bit 0 alone.

The older java.util.Hashtable does not stir the bits this
way -- but then, Hashtable does not use power-of-two bucket
counts as HashMap does. (You can specify a power of two as
the initial Hashtable capacity, if you like, but after any
expansion the bucket count will be odd, not even.)

Again, I apologize for the tone of my earlier message.
But I still think your arguments are wide of the mark.
 
M

Mark Space

public said:
It's not just the distribution of the hash values over the space of
possible integer values that counts. It's the distribution of the hash
values over the actual hash buckets produced in an actual hash table,
which typically is much smaller. You can have few collisions in a large
space and still end up with many in a small space, especially if the
small space's size is a power of two and the low-order bits of the hash
are biased. Even integers are more common than odd, so in any hash table
with integer keys using the current integer hash code and with an even
number of buckets, it's likely the odd buckets will be underutilized and
the even ones will have an unnecessary collision or several.

I just want to point out that Java's hash map doesn't use a straight
mapping between object.hashCode() and the hash index values. Here's the
correct algorithm, for a hash table size of 16:

static int[] testHash2 = {0, 2, 4, 6, 8, 10, 12,
14, 16, 18, 20, 22, 24 };

ArrayList<Integer> hashResults = new ArrayList<Integer>();
final int LENGTH = 16;
for( int i : testHash2 ) {
int h = new Integer( i ).hashCode();
System.out.print( Integer.toHexString( h )+", " );
h ^= (h >>> 20) ^ (h >>> 12);
h = h ^ (h >>> 7) ^ (h >>> 4);
hashResults.add( h & (LENGTH - 1) );
}
System.out.println( );
System.out.println( hashResults );

Output:

0, 2, 4, 6, 8, a, c, e, 10, 12, 14, 16, 18,
[0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top