Please disprove this Double-Checked Locking "fix"

J

jl_post

Hi,

Recently I've been reading up on "Double-Checked Locking" in C++
and how it's often implemented imperfectly. The Article "C++ and the
Perils of Double-Checked Locking" by Scott Meyers and Andrei
Alexandrescu ( http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
) provides a good overview of how it's usually done and why it's often
inadequate.

I won't go into details here, but the article states how this code:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
pInstance = new Singleton;
}
}
return pInstance;
}

and its variants aren't completely thread safe.

Now, I've been thinking about how to make the code thread-safe, and
a few days ago I came up with a couple of solutions. As I implemented
them, I realized that these variants of mine also had problems.

One of the "solutions" I came up with was this:

// Assign the singleton object to a temp pointer:
Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = new Singleton; // initialize to temp
pInstance = temp; // assign temp to pInstance
}
}
return pInstance;
}

Now, this approach isn't new (and is in fact covered in Meyer's and
Alexandrescu's article). The immediate problem is that the compiler
can optimize away the temporary variable, so there's no guarantee that
pInstance won't get the memory before it is properly initialized.

So I thought about making sure pInstance was unset in the
constructor, like this:

Singleton::Singleton() {
// initializing member data
assert( pInstance == 0 );
}

This way, the assert() statement ensures that pInstance will stay NULL
until the constructor is finished.

BUT! I realized that this solution was flawed, because the
compiler can inline the constructor into the ::instance() method, like
this:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = operator new(sizeof(Singleton)); //
initialize to temp
new (temp) Singleton;
// initializing member data
assert( pInstance == 0 );
pInstance = temp; // assign temp to pInstance
}
}
return pInstance;
}

and further rearrange the lines like this:


Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = operator new(sizeof(Singleton)); //
initialize to temp
new (temp) Singleton;
assert( pInstance == 0 );
pInstance = temp; // assign temp to pInstance
// initializing member data
}
}
return pInstance;
}

(The compiler is allowed to do this, I think, because the compiler can
reorder code as long as it's not detectable in a single-threaded
program.)

So I realized pInstance still has the possibility of being assigned
memory before it is properly initialized.

Then I thought: What if I included a second lock to make sure that
the constuctor is finished before pInstance is set? So I tried this:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = new Singleton; // initialize to temp
secondLock.lock();
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
}
return pInstance;
}

Singleton::Singleton() {
secondLock.lock();
// initializing member data
secondLock.unlock();
}

But eventually I realized that the order of the locks is not
guaranteed, that pInstance could still get assigned before the
constructor gets entered!

So I shot down my own two solutions. However, I kept thinking of
variants, and I started to wonder: What if I combined the two
variants? In other words, I'd take advantage of both assert() and a
second lock, like this:


Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = new Singleton; // initialize to temp
secondLock.lock();
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
}
return pInstance;
}

Singleton::Singleton() {
secondLock.lock();
// initializing member data
assert( pInstance == 0 );
secondLock.unlock();
}

In this way, the second lock guarantees that pInstance is not assigned
to inside the constructor code, and the assert() statement ensures
that the constructor code executes before the code that assigns to
pInstance.

This looks good as far as I can tell, but I'm thinking it's too
good to be true.

Since I'm a bit skeptical about this last solution, could someone
poke holes in it? (I'm eager to see if I really did find a solid
solution, or if it's just another pipe dream.)

Thanks!
 
J

Joshua Maurice

Hi,

   Recently I've been reading up on "Double-Checked Locking" in C++
and how it's often implemented imperfectly.  The Article "C++ and the
Perils of Double-Checked Locking" by Scott Meyers and Andrei
Alexandrescu (http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
) provides a good overview of how it's usually done and why it's often
inadequate.
[Snipped discussion of double checked locking]
   Since I'm a bit skeptical about this last solution, could someone
poke holes in it?  (I'm eager to see if I really did find a solid
solution, or if it's just another pipe dream.)

I'm sorry for being so gruff in the following (not really), but it's
evident that you have not even fully read the paper which you cited,
the only paper, and that is highly irritating.

Look, follow this, and understand just how royally screwed your
approach is. Given the initial conditions:
int x = 0;
int y = 0;
Thread 1 executing:
x = 1;
y = 2;
And threads 2-5 executing the following all concurrently with threads
1-5:
cout << x << ' ' << y << endl;
On non-obscure hardware and compilers, that can print /on a single
execution/, all 4 combinations:
00
02
10
12
Repeat, /on a single execution/. Different threads can see some writes
done by different threads in different orders! Thread 2 could see the
write to x without seeing the write to y, and thread 3 can see the
write to y without seeing the write to x.

Technically, the standard guarantees even worse - undefined behavior,
but the above is an example that /really happens/ on /real/ hardware
and /real/ compilers. This is largely due to hardware "reorderings",
but let me quote the paper for the relevant bit:

Nothing you do can alter the fundamental problem: you need to be able
to specify a constraint on instruction ordering, and your language
gives you no way to do it.

To emphasize, it might be the compiler reordering it, it might be the
hardware reordering it, and it could even be some new kind of thing
which hasn't been invented yet! In practice the compiler writers and
hardware makers give guarantees for C++03 which mean that your code is
fundamentally broken. There is no standard or given guarantee of any
kind that any code written without proper synchronization will work.
None. Your code, give it to any compiler writer, and they will say
"won't work", which means while it might incidentally work today, but
tomorrow they might put in a new optimization (software or hardware),
and your code breaks. **This is the fundamental problem which you
cannot dodge!!**

Here's how you can break your code. Note again the general problem
that there are no guarantees that could even make it work, so I want
you to focus on the above basic problem, and do not spend too much
time on this, but I present it completeness. For example:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = new Singleton; // initialize to temp
secondLock.lock();
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
return pInstance;
}

A sufficiently smart compiler is allowed to move things from before a
lock to after the lock, and from after an unlock to before a lock. It
can transform the above to:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
secondLock.lock();
Singleton* temp = new Singleton; // initialize to temp
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
return pInstance;
}

And once we get that, it's trivial to change it to:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
secondLock.lock();
pInstance = new Singleton;
secondLock.unlock();
}
}
return pInstance;
}

Which means we're back to screwed for the reasons known to you. I
didn't even need to resort to the wonderful DEC Alpha and its split
cache, but I could have.

You need to reread the paper which you cited. Here's the link again
for your benefit.
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
And pay attention this time.
 
J

Joshua Maurice

A sufficiently smart compiler is allowed to move things from before a
lock to after the lock, and from after an unlock to before a lock.

Ack, I meant:
A sufficiently smart compiler is allowed to move things from before a
lock to after the lock, and from after an unlock to before
**an unlock**.

Hopefully it's clear from context. My mistake. My apologies.
 
J

Joshua Maurice

This is largely due to hardware "reorderings",
but let me quote the paper for the relevant bit:
Nothing you do can alter the fundamental problem: you need to be able
to specify a constraint on instruction ordering, and your language
gives you no way to do it.
To emphasize, it might be the compiler reordering it, it might be the
hardware reordering it,

Let me underscore the problem, as the quotation above misses part of
the issue. Even when the instructions are executed in exactly the order
that you want them to be executed, different threads can see results in
a different order from the order in which the thread doing the stores
actually did them.

When data is shared between threads and at least one thread is writing
that data, all accesses to that data must be synchronized.

C++0x says that the behavior of a program that does not synchronize
such accesses is undefined.

I fully agree. I also stated that /very explicitly/ just before the
snipped quote with my 1+4 thread example. I did intend to clearly
state exactly that. At least I think I did. For future reference, was
this unclear? Or did you merely mean to emphasize this point? That is
a good point to emphasize.

I'm just a little confused.
 
J

James Kanze

Recently I've been reading up on "Double-Checked Locking" in
C++ and how it's often implemented imperfectly.

You mean, how it is impossible to implement in standard C++
(03).
The Article "C++ and the Perils of Double-Checked Locking" by
Scott Meyers and Andrei Alexandrescu
(http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf)
provides a good overview of how it's usually done and why it's
often inadequate.

The standard solution for implementing a singleton works just
fine:

Singleton* Singleton::instance()
{
ScopedLock lock(mutex);
if ( pInstance == NULL )
pInstance = new Singleton;
return pInstance;
}

If the lock is a performance bottleneck (a case I've yet to
encounter), then you can simply insure that the singleton is
constructed before threads are started, and skip the lock.

Anything else is simply added complexity, for no reason.
 
P

Pavel

Leigh said:
How does this prevent CPU reordering of stores to the pInstance pointer
and the object pInstance points to?
Any system-dependent mutex has to behave like a double-side memory barrier (it's
a follow-up from the mutual exclusion requirement).

This does not prevent from unnecessary overhead though (grabbing a mutex in a
highly contentious case can be expensive and it is unnecessary when all
contenders want is to read a pointer rather than change it -- which is the case
after the Singleton has been initialized).


-Pavel
 
J

James Kanze

How does this prevent CPU reordering of stores to the
pInstance pointer and the object pInstance points to?

Compiler magic:).

Presumably, somewhere in the constructor of ScopedLock, you end
up calling pthread_mutex_lock, or something similar. And if the
compiler is Posix compliant, it knows that it cannot move code
accross a call to pthread_mutex_lock. (And of course, the code
in pthread_mutx_lock contains the necessary machine instructions
to ensure that the hardware respects the order. Posix requires
it.) (Replace Posix with Windows, and posix_mutex_lock with the
corresponding Windows function, if that's your platform.)

Of course, if the compiler doesn't support multithreading (or
you didn't specify the options it requires for it to support
multithreading), then there is no solution; you cant use
multithreading, period.
 
J

James Kanze

On May 1, 10:26 pm, Pavel

[...]
This does not prevent from unnecessary overhead though
(grabbing a mutex in a highly contentious case can be
expensive and it is unnecessary when all contenders want is to
read a pointer rather than change it -- which is the case
after the Singleton has been initialized).

Contention becomes relevant when two conditions are met: many
threads want to acquire the mutex, and having acquired the
mutex, a thread holds it for a certain time. In the case of the
singleton, once the object has been created, the thread holds
the mutex the time it takes for a comparison with null. That
is, a very, very short time. I have yet to see a case where
there was actual contention for the mutex in a singleton once
the singleton has been created. And if you do end up in such a
rare case, as I pointed out, calling Singleton::instance() once
before threading starts is sufficient to ensure that the mutex
isn't needed at all. Double check locking is a solution in
search of a problem, and to top it off, a solution which doesn't
work (at least in standard C++).
 
P

Pavel

James said:
On May 1, 10:26 pm, Pavel

[...]
This does not prevent from unnecessary overhead though
(grabbing a mutex in a highly contentious case can be
expensive and it is unnecessary when all contenders want is to
read a pointer rather than change it -- which is the case
after the Singleton has been initialized).

Contention becomes relevant when two conditions are met: many
threads want to acquire the mutex, and having acquired the
mutex, a thread holds it for a certain time. In the case of the
singleton, once the object has been created, the thread holds
the mutex the time it takes for a comparison with null. That
is, a very, very short time. I have yet to see a case where
there was actual contention for the mutex in a singleton once
the singleton has been created.
I agree most of the times it does not happen. However, I have met situations
when several threads start calling a short method on a single instance (like
sending/reading a simple but often message to/from a singleton queue or similar)
and, after they clash once, the situation called "lock convoy" emerges whereas
the threads spent most of their time switching from runnable to blocked state.

In some old implementations of mutices this condition was made worse by CPU
cache traffic bursts occuring when the holding thread released a mutex -- all
blocked threads were awoken to see which grabs it first; some of the current
implementations have this aggravating issue solved (e.g. in linux 2.6 Kernel
it's solved via futex-based implementation).

Regardless, as soon as a lock convoy is formed, it's kind of self-supporting
until all contending threads but one receive a relatively long message to
process -- all at the same time -- for which you might have to wait for a while.
And if you do end up in such a
rare case, as I pointed out, calling Singleton::instance() once
before threading starts is sufficient to ensure that the mutex
isn't needed at all. Double check locking is a solution in
search of a problem, and to top it off, a solution which doesn't
work (at least in standard C++).
I have never defended DCL. In fact, I have never defended a Singleton pattern.
IMHO it is the most-overused, the least-understood, the vaguest-defined of all
GoF patterns; and few people seem to have read the end of the "Consequences"
sub-section of "Singleton" section of their book (actuating Consequences #4,5
there may invalidate the whole discussion on DCL for Singleton irrelevant).

I can't recall when I used it last time -- most of the times you are much better
without. When I have to use one written by someone else, I try to acquire it
before starting any threads and cache, as per your advice or per the article
referred to by OP.
 
S

Scott Meyers

pinstance points to. The C++0x solution is:

#include <atomic>
std::atomic<Singleton*> pinstance;

if (pinstance == 0) {
Lock lock;
if (pinstance == 0)
pinstance = new Singleton;
}
return pinstance;

Making pinstance an atomic variable ensures that all writes that "happen
before" the assignment to pinstance in one thread are seen before the
result of that assignment is seen in any other thread.

True, but that's not enough to make this code work. You also need to
know that assignment to pinstance can't take place until the Singleton
constructor has run to completion. You have that guarantee in C++0x,
too, so the code is correct, but it's not just the use of atomics that
gives you that correctness. You also need the additional guarantees
that C++0x offers regarding the behavior of new expressions (compared to
the weaker guarantees that C++03 gives you).

Scott
 
P

Pavel

Leigh said:
The ellipsis was an indication that I find your claim that all mutex
implementations include memory barriers slightly dubious. :) From
Wikipedia:

"These primitives (mutexes) are *usually* implemented with the memory
barriers required to provide the expected memory visibility semantics.
In such environments explicit use of memory barriers is not *generally*
necessary."

I guess a mutex implementation that didn't include memory barriers on a
system where memory barriers are needed would not be very useful.
Yes, that's what I meant. Wikipedia is a great resource to find an explanation
of a concept but it does not have to tie all ends together with the strictness
of a specification. In our case, the article on Mutual Exclusion says:

"Mutual exclusion .. algorithms are used in concurrent programming to avoid the
simultaneous use of a common resource, such as a global variable, by pieces of
computer code called critical sections. A critical section is a piece of code in
which a process or thread accesses a common resource".

In our case, we want mutual exclusion for threads and our "common resource" is
memory containing pInstance. A mutex that in fact does *not* allow a thread to
access the common resource (e.g. due to not executing a memory barrier) does not
seem to satisfy to the above definition.

Then of course another article (on memory barriers) written by other people does
not agree.. I do not see an issue with that. Probably they saw something called
"mutex" that required the mutex client code to additionally exercise memory
barrier.. in my and the Wikipedia first article's view it would not be a
complete mutex, anyway. I am fully content with being in this sort of
disagreement with one or more articles at Wikipedia.

-Pavel
 
J

James Kanze

James Kanze wrote:
[...]
In fact, I have never defended a Singleton pattern. IMHO it
is the most-overused, the least-understood, the
vaguest-defined of all GoF patterns; and few people seem to
have read the end of the "Consequences" sub-section of
"Singleton" section of their book (actuating Consequences #4,5
there may invalidate the whole discussion on DCL for Singleton
irrelevant).

I don't think that the presentation in the GoF was actually that
vague. The problem is that as presented in the GoF, it is a
form of contract enforcement: the design says that there may
only be one instance, so we design the class to enforce the
design. Which per se is probably overkill: the semantics of
most singletons are such that no reasonable programmer would
create more than one instance anyway, so it's not worth the
extra effort to forbid it. In the case of C++, however, the
singleton idiom is most often used primarily to solve order of
initialization issues; once you've created the code necessary to
solve those, making it a singleton (enforcing only one instance)
is merely a question of where you put the "public:".
 
J

Joshua Maurice

I didn't realize this was guaranteed for all systems...

Anything that calls itself a mutex or lock in the C++0x or Java memory
model, or any sane memory model for that matter, must guarantee that.
That's word the terms "mutex", "lock", and "critical section" mean. It
guarantees that properly synchronized code will execute according to
the guarantees of the language, and if some hardware needs a "double-
sided memory barrier", then it needs a "double-sided memory barrier".
 
J

Joshua Maurice

True, but that's not enough to make this code work.  You also need to
know that assignment to pinstance can't take place until the Singleton
constructor has run to completion.  You have that guarantee in C++0x,
too, so the code is correct, but it's not just the use of atomics that
gives you that correctness.  You also need the additional guarantees
that C++0x offers regarding the behavior of new expressions (compared to
the weaker guarantees that C++03 gives you).

I don't think I'd describe it that way.

Consider:
pinstance = new Singleton;

For a simple pointer in C++03, there are no threading guarantees
given, so of course it can implement it however it wants, which is
apparently frequently enough something like the following pseudo-code
pinstance = malloc(sizeof(Singleton));
new(pinstance) Singleton();
Even under POSIX, to detect the difference between the above and the
following pseudo-code
temp = malloc(sizeof(Singleton));
new(temp) Singleton();
pinstance = temp;
would require code that has a race condition, which means undefined
behavior, which means that the compiler can do whatever it wants.

In C++0x, the situation is unchanged for a simple pointer. There are
no guarantees specific to "new expressions" which change this. The
compiler is allowed to implement it in exactly the same way. If you
have some code that could detect this, then you still have a race
condition, and thus undefined behavior, and thus all bets are off, so
the compiler free to do as it wants.

What does change is when we throw std::atomic into the mix. When
pinstance is of type std::atomic, the following:
pinstance = new Singleton;
is equivalent to the following ala operator overloading pseudo-code
(forgive me for not knowing the specific function name offhand):
pinstance.set(new Singleton);
And C++0x does have very specific guarantees concerning std::atomic.
It doesn't matter that the argument is a new expression. It could be
any sort of expression or function call. There is no guarantee
specific to new expressions. There are specific guarantees relating to
std::atomic::set and std::atomic::get, and there are guarantees that
describe all expressions, but not new expressions in particular.

Perhaps you were talking about the guarantees specific to function
local static variables? C++0x does guarantee that their initialization
is properly synchronized so that the first access will initialize it,
and all other threads will wait until it's done, and all other threads
will properly see the initialized function local static variable.
However, Pete Becker's code does not use a function local static, and
even then the guarantees about function local statics are not specific
to new expressions. You could initialize one with a normal function
call or some other expression which isn't a new expression, and the
same guarantees apply.
 
C

Chris M. Thomasson

[...]
On 01/05/2011 22:26, Pavel wrote: [...]
Any system-dependent mutex has to behave like a double-side memory
barrier (it's a follow-up from the mutual exclusion requirement).

I didn't realize this was guaranteed for all systems...

The ellipsis was an indication that I find your claim that all mutex
implementations include memory barriers slightly dubious. :) From
Wikipedia:

"These primitives (mutexes) are *usually* implemented with the memory
barriers required to provide the expected memory visibility semantics. In
such environments explicit use of memory barriers is not *generally*
necessary."

I guess a mutex implementation that didn't include memory barriers on a
system where memory barriers are needed would not be very useful.

FWIW, there are mutex implementations that can "skip" explicit memory
barrier operations. You basically need to think in term of "Asymmetric
Synchronization". Here is some more information on the subject:

http://blogs.sun.com/dave/resource/Asymmetric-Dekker-Synchronization.txt

http://www.trl.ibm.com/people/kawatiya/Kawachiya05phd.pdf
(chapter 6)
 
P

Pavel

James said:
James Kanze wrote:
[...]
In fact, I have never defended a Singleton pattern. IMHO it
is the most-overused, the least-understood, the
vaguest-defined of all GoF patterns; and few people seem to
have read the end of the "Consequences" sub-section of
"Singleton" section of their book (actuating Consequences #4,5
there may invalidate the whole discussion on DCL for Singleton
irrelevant).

I don't think that the presentation in the GoF was actually that
vague. The problem is that as presented in the GoF, it is a
form of contract enforcement: the design says that there may
only be one instance,
That's where the vagueness started (in definition of the problem, not the
analysis and especially consequences which most people don't read to the end).

"Only one" -- but in what context?
One per a process?
a thread?
the whole world?
a CPU?
one computer?
corporate network?
a region of the corporate network?

all of the above make sense and are being actively used. 50% of singletons that
were defined in single-threaded programs as one-per-process (and, hence,
per-thread), should be taken to TLS and made "one-per-thread" (and hence,
"many-per-process) when converting to multi-threaded program. The other 50%
should not (but of these 50%, some would gain from changing the requirement to
"only one per machine" and becoming servers and yet others could perfectly
become a pool with many-to-many relationship to the available threads or CPUs --
or cores or who knows what). The techniques for enforcing the "singleness"
contracts above are meaningfully different.

The other vagueness of "only one" is illustrated by the question "during which
period of time"? Should it be the only one intance per life of a process or a
thread? Or it should outlive some processes or threads? Or there should be
exactly one (or no more than one?) instance "at at time" but it can be
re-created several times during the life of a process (or a thread)? All of the
above are useful in particular circumstances and enforcement techniques of
differ significantly on the actual specific requirements (Andrei Alexandrescu
illustrated the latter requirement with his "Phoenix singleton" composed of a
set of different generic policies. Still he did not cover many of the above
variances and I can't imagine a useful implementation that does. Even his
singleton is more proof-of-concept than production-ready code despite relatively
high quality (and of course it does not even try to address the performance
issue of thread-safe singleton that DCL tried to solve. Phoenix singleton does
not work with DCL -- or fixed-with-atomic-DCL -- too well)).

so we design the class to enforce the
design. Which per se is probably overkill: the semantics of
most singletons are such that no reasonable programmer would
create more than one instance anyway, so it's not worth the
extra effort to forbid it. In the case of C++, however, the
singleton idiom is most often used primarily to solve order of
initialization issues; once you've created the code necessary to
solve those, making it a singleton (enforcing only one instance)
is merely a question of where you put the "public:".
Yeah, that's another bunch of good questions (and don't even let me start on
order-of-destruction issues). I prefer explicitly give one object to the
constructor of the another when one depends on another or create them within a
manager object that ensures the sequence. Then you end up with one manager for
all interdependent "singletons" for a particular program. Make next step and
create that manager in "main()" (or, if you are a library writer, make user
create it in his "main()", or, if your user is a library writer, too, make him
include your manager in his own manager that he has to make his user to create
in main etc recursively) and there is no need for a singleton. BUT.. this only
works if no one in the chain of library writers decided to put singleness inside
his or her class (that is, s/he did not give his or her users a choice to create
an object of class defined in his/her library at will). I am trying not to be
such a nuisance; that's why I forgot when there was the last time I decided to
encapsulate singleness in my class.

-Pavel
 
P

Pavel

Chris said:
[...]
On 01/05/2011 22:26, Pavel wrote: [...]
Any system-dependent mutex has to behave like a double-side memory
barrier (it's a follow-up from the mutual exclusion requirement).

I didn't realize this was guaranteed for all systems...

The ellipsis was an indication that I find your claim that all mutex
implementations include memory barriers slightly dubious. :) From
Wikipedia:

"These primitives (mutexes) are *usually* implemented with the memory
barriers required to provide the expected memory visibility semantics. In
such environments explicit use of memory barriers is not *generally*
necessary."

I guess a mutex implementation that didn't include memory barriers on a
system where memory barriers are needed would not be very useful.

FWIW, there are mutex implementations that can "skip" explicit memory
barrier operations. You basically need to think in term of "Asymmetric
Synchronization". Here is some more information on the subject:

http://blogs.sun.com/dave/resource/Asymmetric-Dekker-Synchronization.txt

http://www.trl.ibm.com/people/kawatiya/Kawachiya05phd.pdf
(chapter 6)
Thanks Chris for interesting resources! The techniques described there do show
how to avoid executing memory barriers in both threads and still achieve mutual
exclusion. My statement above still holds: these mutices "behave like a
double-side memory barrier" (quoting myself) without actually executing them.
They ensure the other threads sees the changes in the correct order which
answers the original Leigh's question ("How does this prevent CPU reordering of
stores to the pInstance pointer and the object pInstance points to?").

All that damned C++ "as-if" legalese.. I have never imagined myself using such
language by accident .. but that exactly what happened.

-Pavel
 
J

Joshua Maurice

Here's how you can break your code. Note again the general problem
that there are no guarantees that could even make it work, so I want
you to focus on the above basic problem, and do not spend too much
time on this, but I present it completeness. For example:

    Singleton* Singleton::instance() {
      if (pInstance == 0) {
        Lock lock;
        if (pInstance == 0) {
          Singleton* temp = new Singleton;  // initialize to temp
          secondLock.lock();
          pInstance = temp;  // assign temp to pInstance
          secondLock.unlock();
        }
      }
      return pInstance;
    }

A sufficiently smart compiler is allowed to move things from before a
lock to after the lock, and from after an unlock to before a lock. It
can transform the above to:

    Singleton* Singleton::instance() {
      if (pInstance == 0) {
        Lock lock;
        if (pInstance == 0) {
          secondLock.lock();
          Singleton* temp = new Singleton;  // initialize to temp
          pInstance = temp;  // assign temp to pInstance
          secondLock.unlock();
        }
      }
      return pInstance;
    }

And once we get that, it's trivial to change it to:

    Singleton* Singleton::instance() {
      if (pInstance == 0) {
        Lock lock;
        if (pInstance == 0) {
          secondLock.lock();
          pInstance = new Singleton;
          secondLock.unlock();
        }
      }
      return pInstance;
    }

Which means we're back to screwed for the reasons known to you. I
didn't even need to resort to the wonderful DEC Alpha and its split
cache, but I could have.

You need to reread the paper which you cited. Here's the link again
for your benefit.http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
And pay attention this time.

I was looking over this while searching for interview questions, and I
realized I made a mistake. You cannot make the first reordering so
easily due to the lock nonsense inside the constructor. However, a
sufficiently smart compiler could notice your clever ruse, optimize
away the assert as always true, see a lock and unlock pair guarding
nothing, optimize that away, and then move the assignment to temp past
the mutex acquire, as demonstrated above.

Of course, as I tried to emphasize, and as I will emphasize again now,
that's not the only kind of thing which can break it. Cache coherency
problems can lead to quite bizarre situations, especially with a
processor with a weak memory model. Of course, the most important
detail is that you are not given guarantees that it will work. Thus
implementers, hardware and compiler and linker, and anything new which
hasn't been invented yet, are free to break your code, because it has
undefined behavior, because your program breaks the rules.
 
G

Gerhard Fiedler

Joshua said:
However, a sufficiently smart compiler could notice your clever ruse,
optimize away the assert as always true, see a lock and unlock pair
guarding nothing, optimize that away, and then move the assignment to
temp past the mutex acquire, as demonstrated above.

Regarding the compiler optimizing away a lock/unlock pair guarding
"nothing": AIUI both lock and unlock need to provide certain fences.
Therefore, again AIUI, they can't be optimized away by the compiler even
if there's nothing in between, because that would remove the fences and
alter the behavior.

Am I missing something here?

Gerhard
 
J

Joshua Maurice

Regarding the compiler optimizing away a lock/unlock pair guarding
"nothing": AIUI both lock and unlock need to provide certain fences.
Therefore, again AIUI, they can't be optimized away by the compiler even
if there's nothing in between, because that would remove the fences and
alter the behavior.

Am I missing something here?

That may be how they're commonly implemented, but that's not the
guaranteed semantics. Two different mutexes may as a matter of fact on
a given implementation give "happens-before" effects between the two
different mutexes, but there's nothing guaranteed about it.

I was ambiguously describing the situation, possibly to the extent of
being wrong. Allow me to correct myself.

Remember the code:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = new Singleton; // initialize to temp
secondLock.lock();
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
return pInstance;
}

Singleton::Singleton() {
secondLock.lock();
assert( pInstance == 0 );
secondLock.unlock();
}

The compiler can expand inline as follows (pseudo-code):

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = (Singleton*) ::eek:perator
new(sizeof(Singleton));

secondLock.lock();
assert( pInstance == 0 );
secondLock.unlock();

secondLock.lock();
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
return pInstance;
}

With that, it sees:
someMutex.lock();
<blah1>
someMutex.unlock();
<blah2>
someMutex.lock();
<blah3>
someMutex.unlock();

The compiler sees a lock, unlock, lock, unlock, in straightline code,
without branching (or exceptions, or volatile (to keep signal handlers
correct)). The compiler is totally free to replace that with:
someMutex.lock();
<blah1>
<blah2>
<blah3>
someMutex.unlock();

It may be bad from a QoI to do that. It depends. It depends heavily on
the situation. I see it as quite reasonable that the compiler could
employ some heuristics to deduce when it's a savings to combine
critical sections. In this case, <blah1> and <blah2> are actually
empty, no-ops, so it's always a savings to combine the two adjacent
critical sections as such.

So, it can transform it to:

Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = (Singleton*) ::eek:perator
new(sizeof(Singleton));

secondLock.lock();
assert( pInstance == 0 );
pInstance = temp; // assign temp to pInstance
secondLock.unlock();
}
}
return pInstance;
}

With some simple data flow analysis, and allowed motions past locks,
it can transform it to:
Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
secondLock.lock();
pInstance = (Singleton*) ::eek:perator new(sizeof(Singleton));
secondLock.unlock();
}
}
return pInstance;
}

----

Now, back to my original much more controversial statement - can a
compiler simply remove a lock unlock pair? Ex:
mutex.lock();
mutex.unlock();
Maybe. I mentioned "clever ruse" with whole program optimization in
mind. (However, upon thinking about it, I just showed that you don't
even need whole program optimization.) Without whole program
optimization, I think no. Could someone please more educated weigh
in?

Thus far, after 10 minutes of attempts just now to write a conforming
race-free program where you could tell the difference if a compiler
simply removed an "empty" mutex acquire release pair, the only
programs I can find are ones that would deadlock before the change,
and not deadlock after the change. A deadlock is observable behavior,
so a compiler cannot remove it for that reason.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top