HashMap vs linear table lookup

E

Eric Sosman

Lasse said:
Even if they are computed every time, you don't need to know the hashCode
of elements in the HashSet to find them.
[...]
To look up an element, you use the hash code of the key/element to
search for, find a bucket based on that, and traverse the linked list
to find one that .equals() what you search for.
[...]

You don't *need* the hash code after the bucket is
located, but it's useful to store the hashes anyhow. As
you traverse the list, you can compare the key's hash to
the entry's hash and call .equals() only if they happen
to match. Comparing two already-computed int values is
likely to be faster than calling .equals(), so this can
save some time. HashMap uses this optimization, storing
each item's hash code in its HashMap.Entry.

<topicality level="marginal">

I once did some experiments to determine whether this
optimization was worth while in an "open addressed" hash
table (not the chained style used by HashMap). Storing a
hash value with every item pointer made the table twice as
big as one that contained just the unadorned pointers -- or
to look at it another way, it allowed for only half as many
table slots for a given amount of memory. I wondered whether
avoiding the .equals() calls was enough of a benefit to repay
doubling the table's "load factor" (entry count divided by
slot count); higher load factors lead to longer searches.

In my tests, at least, storing the hash was a clear
winner. Even when a search visited twice as many entries,
it almost never called .equals() in an unsuccessful search
or called it more than once in a successful search (when at
least one .equals() call is unavoidable). The pointers-only
table visited fewer entries, but had to call .equals() on
every single one of them, every time. Pointers-with-hashes
was so much faster that the memory investment to store the
hash values was clearly worth while.

</topicality>
 
M

Mike Schilling

Eric Sosman said:
Lasse said:
Even if they are computed every time, you don't need to know the
hashCode
of elements in the HashSet to find them. [...]
To look up an element, you use the hash code of the key/element to
search for, find a bucket based on that, and traverse the linked
list
to find one that .equals() what you search for.
[...]

You don't *need* the hash code after the bucket is
located, but it's useful to store the hashes anyhow. As
you traverse the list, you can compare the key's hash to
the entry's hash and call .equals() only if they happen
to match. Comparing two already-computed int values is
likely to be faster than calling .equals(), so this can
save some time. HashMap uses this optimization, storing
each item's hash code in its HashMap.Entry.

<topicality level="marginal">

I once did some experiments to determine whether this
optimization was worth while in an "open addressed" hash
table (not the chained style used by HashMap). Storing a
hash value with every item pointer made the table twice as
big as one that contained just the unadorned pointers -- or
to look at it another way, it allowed for only half as many
table slots for a given amount of memory. I wondered whether
avoiding the .equals() calls was enough of a benefit to repay
doubling the table's "load factor" (entry count divided by
slot count); higher load factors lead to longer searches.

In my tests, at least, storing the hash was a clear
winner. Even when a search visited twice as many entries,
it almost never called .equals() in an unsuccessful search
or called it more than once in a successful search (when at
least one .equals() call is unavoidable). The pointers-only
table visited fewer entries, but had to call .equals() on
every single one of them, every time. Pointers-with-hashes
was so much faster that the memory investment to store the
hash values was clearly worth while.

</topicality>


Cool. I wonder, does String.equals() make the same optimization?
(Looks....) As of 1.5, it does not. It checks the length, and if
that matches, it immediately begins marching down the array of bytes.
I suppose it doesn't want to take the hit of calculating the hash
codes unnecessarily.

One other oddity: a string whose hash code is 0 [1] recalculates its
hash code every times it's requested. I'd probably have defined
hashCode() for such a string to return -1, just to avoid that.

1. Say, one which contains all zero bytes.
 
P

Peter Duniho

The source code.

Or do you think I just make this stuff up?

Well, to be fair: there is in fact a difference between a general
statement such as yours ("the hashCode for a String is computed once per
String") and a qualified statement such as the one Thomas offers ("the sun
implementation caches the hash").

Looking at the source code for the Sun implementation doesn't tell you
what all implementations do, and unless the Java specification
specifically says that all implementations must cache the hash code,
there's no guarantee that they do.

You seem offended by Lew's question, but I think it's a reasonable one.
Is there in fact a genuine guarantee that every implementation of String
will always cache the hash code? Or is this an implementation-dependent
behavior?

Granted, the answer may be academic. It's not like one has much, if any,
control over the behavior and I don't think it's the sort of thing that
should, all else being equal, cause someone to dismiss a HashMap as a
solution.

But we're all programmers here, and as such we often have a need for being
very specific and literal in our understanding of what's being said.
Unless you have some reason to believe that _all_ Java implementations are
_guaranteed_ to cache the hash code, a statement about what the actual
behavior of String is should IMHO be qualified as Thomas's was, if for no
other reason than to offer enhanced clarity regarding what you're actually
saying.

Pete
 
M

Mike Schilling

Peter said:
Well, to be fair: there is in fact a difference between a general
statement such as yours ("the hashCode for a String is computed once
per String") and a qualified statement such as the one Thomas offers
("the sun implementation caches the hash").

Looking at the source code for the Sun implementation doesn't tell
you
what all implementations do, and unless the Java specification
specifically says that all implementations must cache the hash code,
there's no guarantee that they do.

You seem offended by Lew's question, but I think it's a reasonable
one. Is there in fact a genuine guarantee that every implementation
of String will always cache the hash code? Or is this an
implementation-dependent behavior?

Granted, the answer may be academic. It's not like one has much, if
any, control over the behavior and I don't think it's the sort of
thing that should, all else being equal, cause someone to dismiss a
HashMap as a solution.

But we're all programmers here, and as such we often have a need for
being very specific and literal in our understanding of what's being
said. Unless you have some reason to believe that _all_ Java
implementations are _guaranteed_ to cache the hash code, a statement
about what the actual behavior of String is should IMHO be qualified
as Thomas's was, if for no other reason than to offer enhanced
clarity regarding what you're actually saying.


I agree 100% with the points you're making. Anything not guaranteed
by the documentation is implementation-specific, and cannot be relied
on.

In this case, though, I'll qualifiy that: the main implementation, on
which all others not developed in a clean room are based, has AFAIK
always cached the hash code. Moreover, doing so is almost free, and
there are no disadvantages beyond the extra two allocated bytes. (I'm
not such an expert in JVM implementation that I know what the actual
cost in space is: it may be zero, or it may be as much as N bytes
where all objects are allocated on N-byte boundaries.) So I'm
willing, *in this case*, to act as if that all implementations do it
until I'm shown one that doesn't.
 
P

Peter Duniho

[...] So I'm
willing, *in this case*, to act as if that all implementations do it
until I'm shown one that doesn't.

I would as well. Frankly, it's such a no-brainer to cache the result that
even if I did run into an implementation that didn't, and as a result
caused some sort of serious performance problem, my first inclination
would be to tell whatever user ran across the problem "go use a decent
Java implementation". :)

But I still think it's reasonable to question whether this is documented
as guaranteed behavior, or is simply implementation-dependent, even if any
sensible implementation would in fact do it.

But that's just me. I've been known to be extra-particular about such
things, sometimes for no apparent reason except for the sake of being
particular. :)

Pete
 
M

Mike Schilling

Peter said:
[...] So I'm
willing, *in this case*, to act as if that all implementations do
it
until I'm shown one that doesn't.

I would as well. Frankly, it's such a no-brainer to cache the
result
that even if I did run into an implementation that didn't, and as a
result caused some sort of serious performance problem, my first
inclination would be to tell whatever user ran across the problem
"go
use a decent Java implementation". :)

But I still think it's reasonable to question whether this is
documented as guaranteed behavior, or is simply
implementation-dependent, even if any sensible implementation would
in fact do it.
But that's just me. I've been known to be extra-particular about
such
things, sometimes for no apparent reason except for the sake of
being
particular. :)

Well, hell, if we agree on this any more violently, one of us is going
to propose, and I'm already married, so I'll stop here.
 
P

Peter Duniho

Well, hell, if we agree on this any more violently, one of us is going
to propose, and I'm already married, so I'll stop here.

It wouldn't have worked out anyway. One of the reasons I love my wife so
much is that she argues with me. :)
 
L

Lew

Mike said:
What would "suitably thread-safe" require here? The external behavior
is that all calls to hashCode() should return the same value. I think
this algorithm works fine:

1. define a private int _hashCode;
2. on a call to hashCode(), check the value of _hashCode. If it's
non-zero, return it.
3. otherwise, calculate the hash code of the string.
4. store this value in _hashcode
5. return _hashcode

The only thread-safety issue is that the store of _hashcode in step 4
be atomic, and I think that that's guaranteed by the JVM.. You could
minimize the number of hash code calculations by locking between steps
2 and 4, ensuring that all threads will see the changed value of
_hashcode once it's ready, but that's merely an optimization.

You're right, because hashCode() is an idempotent calculation. If it gets
calculated twice it does no harm.

The safety in String's hashCode() case has nothing to do with whether storing
the value is atomic or not. Checking the value, then calculating it, then
storing it is not atomic.

Lazy initialization is a check-and-set, so there's no guarantee that the
calculation will occur only once, absent synchronization. It works all right
for idempotent operations but if the calculation were for something that could
change, you'd potentially lose more than time to failure to synchronize.

Nevertheless, it works for calculation of hashCode() for immutable instances
like String's, because the same value is calculated even if repeated.

Thus, "suitably thread-safe" is guaranteed for String hashCode(), but not
always for all calculations for all classes. I thank you for pointing out the
hole in my reasoning - thread safety really doesn't apply to String hashCode().
 
L

Lew

EJP said:
The source code.

Or do you think I just make this stuff up?

I was referring to your citations of other people's Usenet posts. I didn't
think you had made anything up, but I wasn't clear who said what in your
quotations.

As far as the one-time calculation of String hashCode(), the source code for
one JVM is not authoritative. Another JVM could do it differently. Couldn't it?

Furthermore, in the presence of multithreaded code, if String hashCode() is
not synchronized, the value could be calculated more than once. I'm not
familiar with the source you're using - is the hashCode() calculation
synchronized, or done in String construction?
 
M

Mike Schilling

Lew said:
You're right, because hashCode() is an idempotent calculation. If
it
gets calculated twice it does no harm.

The safety in String's hashCode() case has nothing to do with
whether
storing the value is atomic or not. Checking the value, then
calculating it, then storing it is not atomic.

There would be an issue if the value were a long or a double. There
is no guarantee that the whole 64 bits is stored or fetched as an
atomic operation, so that if one thread does

_hashCode64 = l;
return _hashCode64;

as another does

if (_hashCode64 != 0)
return _hashCode;

The return might contain partly the correct result and partly the
original 0.

At least under the original Java memory model. This might have
changed with the current one.
 
P

Patricia Shanahan

Lew wrote:
....
The safety in String's hashCode() case has nothing to do with whether
storing the value is atomic or not. Checking the value, then
calculating it, then storing it is not atomic.
....

It does depend on atomic stores. Suppose, for example, Java were
implemented on a machine with a maximum store width of 16 bits, so that
the only way to change the memory copy of an int is by doing two stores.
Suppose the actual hash code for a given String is 0xABCD.

The hash field would go through the following series of values:

0000 (initial value)
AB00 (store one half of hash field)
ABCD (store other half of hash field)

There is a window during which another thread, executing the hashCode
method in this String, could see AB00, a non-zero value, and use it as
the hashCode() result.

On a machine with an atomic 32 bit store, which includes all systems on
which Java runs, any read of the hash field will either see 0x0000 or
0xABCD, and consistently use 0xABCD as hash code.

Patricia
 
L

Lew

Patricia said:
On a machine with an atomic 32 bit store, which includes all systems on
which Java runs, any read of the hash field will either see 0x0000 or
0xABCD, and consistently use 0xABCD as hash code.

All right. Atomicity is important, but not sufficient for the safety of the
lazy initialization. I also pointed out that if the value is not
synchronized, the value might be set more than once. The safety also stems
from the idempotency. Without the idempotency, the atomicity wouldn't
guarantee correct reads without synchronization.

The point is that before there is a store, there is a read. One thread could
read zero and set the hashCode(). Let's say the calculated hashCode() is
0xCAFEBABE. So thread 1 sets the hashCode to that value. Meanwhile, another
thread reads the value. Because there is no synchronization, thread 2 also
sees zero as the value. Because the calculation is deterministic, it also
sets the hashCode to 0xCAFEBABE. Meanwhile thread 3 wants to use the
hashCode(). It, too, sees zero and calculates the value.

The store is atomic. The cycle of read, calculate and store is not atomic.
Because of that, a String without synchronization, such as by calculating the
hashCode during construction, might have its calculated hashCode() set more
than once.

It is also true, as Patricia and others point out, that a non-atomic write
would break lazy initialization without synchronization.
 
P

Patricia Shanahan

Lew said:
All right. Atomicity is important, but not sufficient for the safety of
the lazy initialization. I also pointed out that if the value is not
synchronized, the value might be set more than once. The safety also
stems from the idempotency. Without the idempotency, the atomicity
wouldn't guarantee correct reads without synchronization.

Agreed. I'm only claiming that atomic store is necessary, not
sufficient, for the String code to work without synchronization.

Patricia
 
E

EJP

Lew said:
I was referring to your citations of other people's Usenet posts. I
didn't think you had made anything up, but I wasn't clear who said what
in your quotations.

Sorry, I misunderstood: point taken.
As far as the one-time calculation of String hashCode(), the source code
for one JVM is not authoritative. Another JVM could do it differently.
Couldn't it?

I suppose so, but given the discussion here it's not very likely, is it?
It would about the first thing that any sane tuning process would show up.
Furthermore, in the presence of multithreaded code, if String hashCode()
is not synchronized, the value could be calculated more than once. I'm
not familiar with the source you're using - is the hashCode()
calculation synchronized, or done in String construction?

In the 1.5.0 source it's done in hashCode() which is not synchronized.
So yes, multiple threads could compute it lazily.

I first noticed the lazy evaluation in about 1997 so I don't think it's
changed much between versions ;-)
 
A

Arne Vajhøj

EJP said:
I suppose so, but given the discussion here it's not very likely, is it?
It would about the first thing that any sane tuning process would show up.

Does not really matter.

It is still a very bad habit to code assuming "any sane tuning process".

Arne
 
E

EJP

Arne said:
It is still a very bad habit to code assuming "any sane tuning process".
On its own, perhaps. It's debatable. I don't think it's a bad habit to
code assuming that String.hashCode is evaluated once per String, which
is what we are really talking about.
 
H

Hendrik Maryns

EJP schreef:
On its own, perhaps. It's debatable. I don't think it's a bad habit to
code assuming that String.hashCode is evaluated once per String, which
is what we are really talking about.

I think it is good habit to code what is most clearly doing what you
intend and then, when your program turns out to be a hog, check where
things are slowing down. You might as well stumble upon a ‘bad’ string
hash implementation (but probably won’t). You know, the premature
optimization thingy.

H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFHxDlje+7xMGD3itQRAr5XAJ91zs8T/yrxtwo8dduHDC5jPSjkxgCfQmUx
e8YGunKMOz7CNEDajGxhfXA=
=HCj0
-----END PGP SIGNATURE-----
 
E

EJP

I agree. On further consideration the first program that would slow to a
crawl if String.hashCode() was re-evaluated would be 'javac' ... so the
likelihood of such an implementation is genuinenly pretty small.
 
L

Lew

EJP said:
I agree. On further consideration the first program that would slow to a
crawl if String.hashCode() was re-evaluated would be 'javac' ... so the
likelihood of such an implementation is genuinenly pretty small.

Just out of curiosity, how would that affect javac?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top