performance of double checked locking

G

Gerald Thaler

Hello

The double checked locking idiom (thread safe singelton pattern) now works
correct under the current memory model:

public class MySingleton {

private static volatile MySingleton instance;

public static MySingleton getInstance() {
if (instance == null) {
synchronized (MySingleton.class) {
if (instance == null) {
instance = new MySingleton();
}
}
}
return instance;
}
}

Nevertheless the Java experts still discourage its use, claiming that there
would be no signifcant performance advantage over simply synchronizing the
whole getInstance() method, as volatile variable accesses are similar to
(half) a synchronization.

For example see:
http://www-106.ibm.com/developerworks/library/j-jtp03304/?ca=dnt-513 [Brian
Goetz]

But is this really true? IMHO the volatile read in the common code path
should be much more efficient than a monitor enter can be in any reasonable
JVM implementation. All it has usually to do is to cross a read barrier
before the read of the variable 'instance'. This affects only one processor
and will cost very few cycles if any. A monitor enter in contrast would
require a bus lock during a read-modify-write operation. This would stall
every processor in the system. So my feeling is, that a monitor enter should
be much more expensive than a volatile read.

The linked site above gives the following advice:

"Instead of double-checked locking, use the Initialize-on-demand Holder
Class idiom, which provides lazy initialization, is thread-safe, and is
faster and less confusing than double-checked locking."

I don't agree. First, the Initialize-on-demand Holder Class idiom doesn't
garantee that the Singleton is not constructed until its first use. JVMs
have great freedom here. Second, if it does indeed lazy initialization, it
can't be any faster than DCL, because it too has to cross at least one read
barrier internally.

Am i overlooking something here?
 
X

xarax

Gerald Thaler said:
Hello

The double checked locking idiom (thread safe singelton pattern) now works
correct under the current memory model:

public class MySingleton {

private static volatile MySingleton instance;

public static MySingleton getInstance() {
if (instance == null) {
synchronized (MySingleton.class) {
if (instance == null) {
instance = new MySingleton();
}
}
}
return instance;
}
}

Nevertheless the Java experts still discourage its use, claiming that there
would be no signifcant performance advantage over simply synchronizing the
whole getInstance() method, as volatile variable accesses are similar to
(half) a synchronization.

For example see:
http://www-106.ibm.com/developerworks/library/j-jtp03304/?ca=dnt-513 [Brian
Goetz]

But is this really true? IMHO the volatile read in the common code path
should be much more efficient than a monitor enter can be in any reasonable
JVM implementation. All it has usually to do is to cross a read barrier
before the read of the variable 'instance'. This affects only one processor
and will cost very few cycles if any. A monitor enter in contrast would
require a bus lock during a read-modify-write operation. This would stall
every processor in the system. So my feeling is, that a monitor enter should
be much more expensive than a volatile read.

The linked site above gives the following advice:

"Instead of double-checked locking, use the Initialize-on-demand Holder
Class idiom, which provides lazy initialization, is thread-safe, and is
faster and less confusing than double-checked locking."

I don't agree. First, the Initialize-on-demand Holder Class idiom doesn't
garantee that the Singleton is not constructed until its first use. JVMs
have great freedom here. Second, if it does indeed lazy initialization, it
can't be any faster than DCL, because it too has to cross at least one read
barrier internally.

Am i overlooking something here?

The main "problem" with DCL is that some JVM implementations are
allowed to assign the value of "instance" before the constructor
has finished.

That is, the line:

instance = new MySingleton();

Creates a new object, assigns the reference for that object
to "instance", then calls the constructor. This makes the
reference visible to other threads before the object is fully
constructed. This behavior is described (vaguely) under
"prescient stores" in the JVM/JLS specs.

However, declaring the field "instance" as a volatile would seem to
prevent that kind of behavior. A volatile field must be fetched/stored
in exactly the same sequence as specified in the original source
code (the compiler is not allowed to perform code movement).

Another way to avoid the "prescient store" thing is to assign
the reference to a local variable, then enter/exit a synchronized
block after the constructor call, then assign the field.

public class MySingleton {

private static volatile MySingleton instance;

public static MySingleton getInstance()
{
if(instance == null)
{
synchronized (MySingleton.class)
{
if(instance == null)
{
MySingleton foo;

foo = new MySingleton();
/* ensure constructor is finished */
synchronized(foo)
{
instance = foo;
}
}
}
}
return instance;
}
}

The assignment of "foo" can legally occur "early" before
the constructor is called. (Some compilers may place
the constructor in-line immediately after the assignment.)

However, entering the subsequent monitor block requires that
the constructor is completely finished. Inside the monitor
block it is safe to then assign the "instance" field. This
is also a point of contention among the "experts", because
monitor "entry" only flushes fetches and monitor "exit" only
flushes stores. A reasonable JVM implementation will flush
both fetches and stores upon entry and exit of a monitor
block.

A variation of this solution uses ThreadLocal storage, which
I can't recall at the moment.

HTH
 
G

Gerald Thaler

The main "problem" with DCL is that some JVM implementations are
allowed to assign the value of "instance" before the constructor
has finished.

That is, the line:

instance = new MySingleton();

Creates a new object, assigns the reference for that object
to "instance", then calls the constructor. This makes the
reference visible to other threads before the object is fully
constructed. This behavior is described (vaguely) under
"prescient stores" in the JVM/JLS specs.
However, declaring the field "instance" as a volatile would seem to
prevent that kind of behavior. A volatile field must be fetched/stored
in exactly the same sequence as specified in the original source
code (the compiler is not allowed to perform code movement).

The old memory model is broken anyway. By the current memory model (JSR 133)
the volatile establishes a correct happens-before relationship. So DCL works
as expected.
Another way to avoid the "prescient store" thing is to assign
the reference to a local variable, then enter/exit a synchronized
block after the constructor call, then assign the field.

public class MySingleton {

private static volatile MySingleton instance;

public static MySingleton getInstance()
{
if(instance == null)
{
synchronized (MySingleton.class)
{
if(instance == null)
{
MySingleton foo;

foo = new MySingleton();
/* ensure constructor is finished */
synchronized(foo)
{
instance = foo;
}
}
}
}
return instance;
}
}

The assignment of "foo" can legally occur "early" before
the constructor is called. (Some compilers may place
the constructor in-line immediately after the assignment.)

However, entering the subsequent monitor block requires that
the constructor is completely finished. Inside the monitor
block it is safe to then assign the "instance" field. This
is also a point of contention among the "experts", because
monitor "entry" only flushes fetches and monitor "exit" only
flushes stores. A reasonable JVM implementation will flush
both fetches and stores upon entry and exit of a monitor
block.

No, you're wrong here. Any JVM is legally allowed to move instructions from
outside into a synchronized block, and most will do so. So the local
variable and the synchronized(foo) doesn't help at all. The JVM is even
allowed to fully optimize the synchronized(foo) away if it can proof that no
other thread will ever synchronize on foo/instance. For synchronized(foo) to
enforce memory ordering constraints between two threads in any way, _both_
threads must synchronize on foo. This isn't the case here.

But as the instance variable is volatile anyway, theres no problem here.
A variation of this solution uses ThreadLocal storage, which
I can't recall at the moment.

Yes. But even this imho cannot be faster than the DCL.
 
J

John C. Bollinger

Gerald said:
Hello

The double checked locking idiom (thread safe singelton pattern) now works
correct under the current memory model:

Where "works correct" means "is thread safe".
public class MySingleton {

private static volatile MySingleton instance;

public static MySingleton getInstance() {
if (instance == null) {
synchronized (MySingleton.class) {
if (instance == null) {
instance = new MySingleton();
}
}
}
return instance;
}
}

Nevertheless the Java experts still discourage its use, claiming that there
would be no signifcant performance advantage over simply synchronizing the
whole getInstance() method, as volatile variable accesses are similar to
(half) a synchronization.

For example see:
http://www-106.ibm.com/developerworks/library/j-jtp03304/?ca=dnt-513 [Brian
Goetz]

But is this really true? IMHO the volatile read in the common code path
should be much more efficient than a monitor enter can be in any reasonable
JVM implementation. All it has usually to do is to cross a read barrier
before the read of the variable 'instance'. This affects only one processor
and will cost very few cycles if any. A monitor enter in contrast would
require a bus lock during a read-modify-write operation. This would stall
every processor in the system. So my feeling is, that a monitor enter should
be much more expensive than a volatile read.

Read the reference material you pointed to yourself. When a thread
performs a read on a volatile variable, it must (logically) discard its
local memory and reload from main memory, very much like on a monitor
enter. This is required by the new semantics of volatile that prevent
accesses to nonvolatile variables from being reordered with accesses to
volatile ones in the program order. It may be that a JVM can provide
very efficient implementation of this requirement, but in that case it
should be able to provide a similarly efficient implementation of
monitor entry.

Now there _is_ the point that a read (only) of a volatile variable is
not paired with any equivalent of a monitor exit, which does make it
somewhat lighter-weight than synchronization (and rather heavier-weight
than it used to be). That probably isn't relevant to the double-checked
locking situation, however: if the whole getInstance() method were
synchronized then an optimizer could observe that no writes occur within
the synchronized block (after the first time) and that therefore no
actions need be performed on memory at monitor exit. Thus,
synchronizing the whole getInstance() method can be very nearly as fast
as double-checked locking of access to a volatile variable.
The linked site above gives the following advice:

"Instead of double-checked locking, use the Initialize-on-demand Holder
Class idiom, which provides lazy initialization, is thread-safe, and is
faster and less confusing than double-checked locking."

I don't agree. First, the Initialize-on-demand Holder Class idiom doesn't
garantee that the Singleton is not constructed until its first use. JVMs
have great freedom here. Second, if it does indeed lazy initialization, it
can't be any faster than DCL, because it too has to cross at least one read
barrier internally.

1) A high-performance JVM will perform lazy initialization if it
improves performance. Other JVMs are not relevant, nor is lazy
initialization if it doesn't improve performance.

2) Initialize-on-demand can be significantly faster than DCL on a
volatile variable, because after initialization (of the _final_
variable) no read barrier need be crossed to access it.


I really fail to understand the fascination with DCL. It used to not
provide the intended thread safety in Java, but people kept trying to
"fix" it. Under the new memory model DCL can be made to work, but at a
probable performance cost relative to some of the same simpler
alternatives that have always worked. It is inherent in the design of
DCL that it must place some kind of synchronization demands on every
access to the protected variable. There is no way to avoid it within
that scheme while still providing thread safety. Consider, for
instance, that the additional overhead required for volatile access
under the _old_ JMM *wasn't enough* to ensure thread safety for DCL.

So GET OVER IT! Do not use DCL in Java. The plain flavor is broken,
and on older VMs the volatile-based flavor is both more expensive than
the plain one *and* broken. On VMs where the volatile-based version is
thread-safe, it is comparatively expensive relative to alternatives.


John Bollinger
(e-mail address removed)
 
G

Gerald Thaler

Read the reference material you pointed to yourself. When a thread
performs a read on a volatile variable, it must (logically) discard its
local memory and reload from main memory, very much like on a monitor
enter. This is required by the new semantics of volatile that prevent
accesses to nonvolatile variables from being reordered with accesses to
volatile ones in the program order. It may be that a JVM can provide very
efficient implementation of this requirement, but in that case it should
be able to provide a similarly efficient implementation of monitor entry.

This is just not true. Monitor entry does more: It provides mutual
exclusion. It's impossible to implement this without atomic
read-modify-write instrucions at the machine level. They must lock the bus
and so effekt _every_ processor in the system. Look at the x86 for example:
Monitor entry must execute an instruction like LOCK XADD. This is painfully
slow. For volatile read it suffices that the Compiler/VM doesn't reorder
instructions. This may prevent some optimizations that would otherwise
apply. But other than that there are _no_ performance penalties in the
common code path at all. On x86 volatile reads are way faster than monitor
entry. On other architectures the cost may be somewhat higher.
1) A high-performance JVM will perform lazy initialization if it improves
performance. Other JVMs are not relevant, nor is lazy initialization if
it doesn't improve performance.

I doubt that. Many VMs will load a class as soon as the control flow enters
a method, that *could* access it:

public void method() {
if (christmasIsToday()) {
BigFatPresent present = BigFatPresent.getInstance();
// Do something with is
}
}

The VM cannot decide wether my static initialization is expensive or not.
2) Initialize-on-demand can be significantly faster than DCL on a volatile
variable, because after initialization (of the _final_ variable) no read
barrier need be crossed to access it.

There must be a read barrier somewhere.
I really fail to understand the fascination with DCL. It used to not
provide the intended thread safety in Java, but people kept trying to
"fix" it.
Under the new memory model DCL can be made to work, but at a probable
performance cost relative to some of the same simpler alternatives that
have always worked.
It is inherent in the design of DCL that it must place some kind of
synchronization demands on every access to the protected variable. There
is no way to avoid it within that scheme while still providing thread
safety. Consider, for instance, that the additional overhead required for
volatile access under the _old_ JMM *wasn't enough* to ensure thread
safety for DCL.

So GET OVER IT! Do not use DCL in Java. The plain flavor is broken, and
on older VMs the volatile-based flavor is both more expensive than the
plain one *and* broken. On VMs where the volatile-based version is
thread-safe, it is comparatively expensive relative to alternatives.

"volatile" is a valid and useful means of thread-communication under the new
memory model and DCL isn't broken anymore. But i agree, that in 99.9% of all
situations it's simply not worth to optimize away the synchronization.
 
X

xarax

Gerald Thaler said:
The old memory model is broken anyway. By the current memory model (JSR 133)
the volatile establishes a correct happens-before relationship. So DCL works
as expected.


No, you're wrong here. Any JVM is legally allowed to move instructions from
outside into a synchronized block, and most will do so. So the local
variable and the synchronized(foo) doesn't help at all.

I think we agree to disagree on this point.
The JVM is even
allowed to fully optimize the synchronized(foo) away if it can proof that no
other thread will ever synchronize on foo/instance.

I doubt that is possible in the general case. Trivial cases
may be possible, but not worth the extra investment in the
JVM implementation.
For synchronized(foo) to
enforce memory ordering constraints between two threads in any way, _both_
threads must synchronize on foo. This isn't the case here.

The definition for monitor entry/exit does not depend
on whether there is another thread in contention for
the monitor. Memory flushes happen regardless of whether
there is another thread in the picture.
 
G

Gerald Thaler

No, you're wrong here. Any JVM is legally allowed to move instructions
I think we agree to disagree on this point.

We are talking about the new memory model JSR 133 (Java 1.5), not the broken
old one?
I doubt that is possible in the general case. Trivial cases
may be possible, but not worth the extra investment in the
JVM implementation.

It is possible. For example use a java.util.Vector locally only in one
method as a helper object. This is easy to verify by the compiler/VM. It
then can remove all synchronizations inside the code of Vector. Thus the
resulting code will run with the same speed as a java.util.ArrayList. It was
one goal of JSR 133 to make such optimizations possible.
The definition for monitor entry/exit does not depend
on whether there is another thread in contention for
the monitor. Memory flushes happen regardless of whether
there is another thread in the picture.

No, that's not true. The memory model is not defined in terms of "memory
flushes", but in terms of "actions" and "happens-before". synchronized(new
Object()) is a no-op. It does not "flush memory". This is explicitly stated
in JSR 133.
 
P

Patricia Shanahan

Gerald said:
This is just not true. Monitor entry does more: It provides mutual
exclusion. It's impossible to implement this without atomic
read-modify-write instrucions at the machine level. They must lock the bus
and so effekt _every_ processor in the system. Look at the x86 for example:
Monitor entry must execute an instruction like LOCK XADD. This is painfully
slow. For volatile read it suffices that the Compiler/VM doesn't reorder
instructions. This may prevent some optimizations that would otherwise
apply. But other than that there are _no_ performance penalties in the
common code path at all. On x86 volatile reads are way faster than monitor
entry. On other architectures the cost may be somewhat higher.

The cost of atomic read-modify-write instructions and the
consequences of imposing memory order rules are both
hardware dependent.

If cache coherence is maintained through e.g. an MESI
write-back protocol, atomic read-modify-write only requires
locking out of external demands on the executing processor's
cache, with the line containing the variable in a modifiable
state, for the time it takes that processor to do the
complete operation.

Hardware memory order rules for multiprocessors are
architecture dependent. Rules other than sequential
consistency may require extra actions, and delays, to avoid
hardware reordering relative to a volatile access.

Has anyone published measurements of double checked locking
on different multiprocessors?

Patricia
 
J

John C. Bollinger

Gerald said:
This is just not true. Monitor entry does more: It provides mutual
exclusion.

Read or write of a volatile variable must also provide mutual exclusion
for the duration of the read or write. The operation might be atomic
from the Java perspective (but only if the value isn't a long or
double), but nothing can be said about whether or not it could be atomic
on unspecified computing hardware.
It's impossible to implement this without atomic
read-modify-write instrucions at the machine level. They must lock the bus
and so effekt _every_ processor in the system. Look at the x86 for example:
Monitor entry must execute an instruction like LOCK XADD. This is painfully
slow. For volatile read it suffices that the Compiler/VM doesn't reorder
instructions. This may prevent some optimizations that would otherwise
apply. But other than that there are _no_ performance penalties in the
common code path at all. On x86 volatile reads are way faster than monitor
entry. On other architectures the cost may be somewhat higher.

I am not sufficiently expert on x86 to argue your point about that
architecture, but for the same reason I am not prepared to accept your
assertions there on your word alone. I am particularly not prepared to
accept statements about what certain high-level operations "must"
involve at lower levels, especially when JIT is taken into
consideration. Consider this, for instance:

public class MyClass {

private static MyClass instance = new MyClass();

public static MyClass getInstance() {
MyClass rval;

/*
* Written with an inner synchronized block instead of a
* synchronized method to avoid any question about what
* work must happen within the scope of the synchronization
*/
synchronized(this) {
rval = instance;
}

return rval;
}
}

I claim that under the new memory model, JIT can safely handle that
exactly the same as if variable "instance" were volatile and there were
no synchronization.
I doubt that. Many VMs will load a class as soon as the control flow enters
a method, that *could* access it:

Initialization need not happen when the class is loaded, as long as it
happens before the class is used. The loading question is irrelevant
because it applies equally to a DCL scenario. On the other hand, modern
VMs generally do delay initialization.

[...]
The VM cannot decide wether my static initialization is expensive or not.

It could apply simple heuristics to come to a reasonable first-order
guess. There are other possibilities too. I flatly refuse to accept
your unsupported assertion.
There must be a read barrier somewhere.

Why? If the variable is final then I see no need for a read barrier.
All reads, forever, will return the same value, regardless of the
actions of any other threads. Even if the read is not atomic. Even on
a multi-CPU system with multiport memory and no cache coherency support.
A thread could even safely cache a copy of the value and use that as
long as it liked.


John Bollinger
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top