Compiler ordering barriers in C++0x

Dmitriy Vyukov · Apr 24, 2008

Will it be possible to issue only compiler ordering barrier (release,
acquire or full), but not hardware ordering barrier in C++0x?

Accesses to volatile variables are ordered only with respect to
accesses to other volatile variables. What I want it to order access
to variable with respect to accesses to all other volatile and non-
volatile variables (w/o any hardware barriers, only compiler
ordering).

Now it can be accomplished with "__asm__ __volatile__
("" : : :"memory")" on gcc, and with _ReadWriteBarrier() on msvc. I am
interested whether it will be possible to do this in C++0x language w/
o compiler dependency.

Such compiler barriers are useful in effective synchronization
algorithms like SMR+RCU:
http://sourceforge.net/project/showfiles.php?group_id=127837
(fastsmr package)
because they allows one to eliminate all hardware memory barriers from
fast-path.

Dmitriy V'jukov

nickf3 · Apr 28, 2008

Will it be possible to issue only compiler ordering barrier (release,
acquire or full), but not hardware ordering barrier in C++0x?

Accesses to volatile variables are ordered only with respect to
accesses to other volatile variables. What I want it to order access
to variable with respect to accesses to all other volatile and non-
volatile variables (w/o any hardware barriers, only compiler
ordering).

Now it can be accomplished with "__asm__ __volatile__
("" : : :"memory")" on gcc, and with _ReadWriteBarrier() on msvc. I am
interested whether it will be possible to do this in C++0x language w/
o compiler dependency.

Such compiler barriers are useful in effective synchronization
algorithms like SMR+RCU:http://sourceforge.net/project/showfiles.php?group_id=127837
(fastsmr package)
because they allows one to eliminate all hardware memory barriers from
fast-path.

Dmitriy V'jukov

Is this really possible without explicit hardware support
on archs like sparc with partial store-ordering?
Just wondering.
--
Nikolai

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Dmitriy Vyukov · May 1, 2008

Is this really possible without explicit hardware support
on archs like sparc with partial store-ordering?
Just wondering.

What is possible? Enforcement of compiler ordering? It's possible
everywhere, and it affects only compiler, not hardware platform.

Dmitriy V'jukov

Dmitriy Vyukov · May 1, 2008

Will it be possible to issue only compiler ordering barrier (release,
acquire or full), but not hardware ordering barrier in C++0x?

Accesses to volatile variables are ordered only with respect to
accesses to other volatile variables. What I want it to order access
to variable with respect to accesses to all other volatile and non-
volatile variables (w/o any hardware barriers, only compiler
ordering).

Now it can be accomplished with "__asm__ __volatile__
("" : : :"memory")" on gcc, and with _ReadWriteBarrier() on msvc. I am
interested whether it will be possible to do this in C++0x language w/
o compiler dependency.

I think that the answer is NO - C++0x will not allow only compiler
ordering. I just want to be sure.

It's bad news. It means that one will be able to implement only
'basic' synchronization algorithms in portable way. As for 'advanced'
synchronization algorithms like SMR+RCU:
http://sourceforge.net/project/showfiles.php?group_id=127837
(fastsmr package)

asymmetric reader-writer mutex:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/8c6c19698964d1f6

sequence mutex:
http://en.wikipedia.org/wiki/Seqlock

one still will have to build 'my own atomic library' and port it to
every hardware platform and compiler. C++0x was so promising for
portable synchronization algorithms...
Yikes!

Dmitriy V'jukov

Anthony Williams · May 3, 2008

Szabolcs Ferenczi said:
[...]
All accesses to shared data MUST be synchronized with atomics: [...]

Click to expand...

Can you elaborate this point please. How can you generally synchronise
N processes with help of atomics? Do you mean only two processes under
certain circumstances?

If any thread modifies shared data that is not of type atomic_xxx, the
developer must ensure appropriate synchronization with any other thread that
accesses that shared data in order to avoid a data race (and the undefined
behaviour that comes with that). If you're using atomics, that means you must
have (at least) a release on some atomic variable by the modifying thread
after the modify, and an acquire on the same atomic variable by the other
thread before its access (read or write).

If you have N threads accessing the shared data, and one does a modify, all
the other N-1 threads must be properly synchronized with the modifying thread
as above in order to avoid data races.

Of course, you could also just use a mutex lock or join with the thread doing
the modification.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Dmitriy V'jukov · May 3, 2008

Within a single thread, all operations *are* ordered. The ordering relations
are happens-before and sequenced-before. A is sequenced-before before B when
A physically comes before B in the source code, or the sequencing rules say so
(e.g. lhs of a comma op is sequenced before rhs).

If A is sequenced-before B, then A also happens-before B. If you then
introduce a synchronization edge between thread T1 and thread T2
(e.g. store(mem_order_release) on T1, load(mem_order_acquire) on T2), then
anything that happens-before the store also happens-before the load.

All accesses to shared data MUST be synchronized with atomics: if there are
two accesses to a non-atomic variable, at least one of which is a write, and
there is NOT a happens-before relation between them, you have a data race, and
undefined behaviour.

If you can prove that you have a happens-before relation, you're safe,
otherwise you need to do some synchronization, or perhaps use
mem_order_relaxed.

I can prove that I have happens-before relation if I can enforce
correct compiler ordering. Consider following code:

std::vector<int> g_nonatomic_user_data;
std::atomic_int g_atomic1;
std::atomic_int g_atomic2;

std::vector<int> thread()
{
g_atomic1.store(1, std::memory_order_relaxed);
// compiler store-load fence
if (g_atomic2.load(std::memory_order_relaxed))
{
g_atomic1.store(0, std::memory_order_relaxed);
return std::vector<int>();
}
// compiler acquire fence
std::vector<int> local = g_nonatomic_user_data;
// compiler release fence
g_atomic1.store(0, std::memory_order_relaxed);
return local;
}

It's a kind of Peterson's algorithm *but* w/o store-load memory
barrier, and w/o acquire and release barriers. The point is that I
still can prove happens-before relation wrt accesses to
g_nonatomic_user_data, if I can enforce mentioned compiler barriers.
Basically I need to ensure that accesses to g_nonatomic_user_data will
not hoist above or sink below accesses to g_atomic1 in generated
machine code.
Now I can do this with _ReadWriteBarrier (msvc) or "__asm__
__volatile__ ("" : : :"memory")" (gcc).

Dmitriy V'jukov

Anthony Williams · May 3, 2008

Dmitriy V'jukov said:
I can prove that I have happens-before relation if I can enforce
correct compiler ordering. Consider following code:

std::vector<int> g_nonatomic_user_data;
std::atomic_int g_atomic1;
std::atomic_int g_atomic2;

std::vector<int> thread()
{
g_atomic1.store(1, std::memory_order_relaxed);
// compiler store-load fence
if (g_atomic2.load(std::memory_order_relaxed))
{
g_atomic1.store(0, std::memory_order_relaxed);
return std::vector<int>();
}
// compiler acquire fence
std::vector<int> local = g_nonatomic_user_data;
// compiler release fence
g_atomic1.store(0, std::memory_order_relaxed);
return local;
}

It's a kind of Peterson's algorithm *but* w/o store-load memory
barrier, and w/o acquire and release barriers. The point is that I
still can prove happens-before relation wrt accesses to
g_nonatomic_user_data, if I can enforce mentioned compiler barriers.
Basically I need to ensure that accesses to g_nonatomic_user_data will
not hoist above or sink below accesses to g_atomic1 in generated
machine code.
Now I can do this with _ReadWriteBarrier (msvc) or "__asm__
__volatile__ ("" : : :"memory")" (gcc).

You're using relaxed atomics. This is dangerous. In particular, relaxed
atomics do NOT provide happens-before relations between threads. In
particular, the relaxed load of g_atomic2 doesn't give you any guarantee about
the state of g_nonatomic_user_data.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Chris Thomasson · May 4, 2008

[...]

I can prove that I have happens-before relation if I can enforce
correct compiler ordering. Consider following code:

std::vector<int> g_nonatomic_user_data;
std::atomic_int g_atomic1;
std::atomic_int g_atomic2;

std::vector<int> thread()
{
g_atomic1.store(1, std::memory_order_relaxed);
// compiler store-load fence
if (g_atomic2.load(std::memory_order_relaxed))

[...]

Are you sure that atomic store with 'memory_order_relaxed' semantics injects
a #StoreLoad after the operation?

Szabolcs Ferenczi · May 4, 2008

Szabolcs Ferenczi said:
Szabolcs Ferenczi said:

[...]
All accesses to shared data MUST be synchronized with atomics: [...]

Click to expand...

Click to expand...

Can you elaborate this point please. How can you generally synchronise
N processes with help of atomics? Do you mean only two processes under
certain circumstances?

Click to expand...

If any thread modifies shared data that is not of type atomic_xxx, the
developer must ensure appropriate synchronization with any other thread that
accesses that shared data in order to avoid a data race (and the undefined
behaviour that comes with that).

It is clear that you must synchronise access to shared variable.
Normally you must use a Critical Region for that.

I was curious how do you synchronise access to shared data with
atomics. Note that atomics only provide this synchronisation for the
access of the atomics themselves but you claimed something like with
atomics you can synchronise access to non-atomic shared data. How? Can
you provide example, please.

If you're using atomics, that means you must
have (at least) a release on some atomic variable by the modifying thread
after the modify, and an acquire on the same atomic variable by the other
thread before its access (read or write).

It is a bit too an abstract description to me. E.g. "after the modify"
of what? Can you provide some example? I hope you do not mean that the
access of an atomic variable can help in the synchronised access to
another variable that is not an atomic one. Let us see some examples.

[...]
Of course, you could also just use a mutex lock or join with the thread doing
the modification.

That is correct. You can synchronise access with mutexes (implementing
a Critical Region by hand).

I think you refer with the "join with the thread" phrase the end of a
structured parallel block where a shared variable becomes a non-shared
one. Again, an example could help. Please give an example illustrating
what you mean.

Best Regards,
Szabolcs

Anthony Williams · May 5, 2008

Szabolcs Ferenczi said:
Szabolcs Ferenczi said:

[...]
All accesses to shared data MUST be synchronized with atomics: [...]

Click to expand...

Can you elaborate this point please. How can you generally synchronise
N processes with help of atomics? Do you mean only two processes under
certain circumstances?

Click to expand...

If any thread modifies shared data that is not of type atomic_xxx, the
developer must ensure appropriate synchronization with any other thread that
accesses that shared data in order to avoid a data race (and the undefined
behaviour that comes with that).

Click to expand...

It is clear that you must synchronise access to shared variable.
Normally you must use a Critical Region for that.

I was curious how do you synchronise access to shared data with
atomics. Note that atomics only provide this synchronisation for the
access of the atomics themselves but you claimed something like with
atomics you can synchronise access to non-atomic shared data. How? Can
you provide example, please.

If you use acquire/release pairs or seq_cst atomics, then they introduce a
synchronizes-with relation between threads. This in turn introduces a
happens-before relation, which makes the data modified by the thread that did
the store/release visible to the thread that did the load/acquire. This allows
the use of atomics to build mutexes and other synchronization primitives.

std::atomic_flag f=ATOMIC_FLAG_INIT;
std::vector<std::string> shared_data;

void thread_a()
{
while(f.test_and_set(std::memory_order_acq_rel)); // spin lock
shared_data.push_back("hello");
f.clear(std::memory_order_release); // unlock
}

void thread_b()
{
while(f.test_and_set(std::memory_order_acq_rel)); // spin lock
if(!shared_data.empty())
{
std::cout<<shared_data.front()<<std::endl;
shared_data.pop_front();
}
f.clear(std::memory_order_release); // unlock
}

The clear() from thread_a is a release operation, so synchronizes with the
test_and_set (which is an acquire) in thread_b, which can therefore see the
modification to shared_data, since the push_back happens-before the clear,
which happens-before the test-and-set, which happens-before the accesses in
thread_b.

[...]
Of course, you could also just use a mutex lock or join with the thread doing
the modification.

Click to expand...

That is correct. You can synchronise access with mutexes (implementing
a Critical Region by hand).

I think you refer with the "join with the thread" phrase the end of a
structured parallel block where a shared variable becomes a non-shared
one. Again, an example could help. Please give an example illustrating
what you mean.

std::thread t(thread_a); // thread_a from above
t.join();

assert(shared_data.back()=="hello");

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Dmitriy Vyukov · May 5, 2008

[this is in response to msg #13688, which has not been processed yet... I
made a mistake, here is correction...]

[...]

I can prove that I have happens-before relation if I can enforce
correct compiler ordering. Consider following code:
std::vector<int> g_nonatomic_user_data;
std::atomic_int g_atomic1;
std::atomic_int g_atomic2;
std::vector<int> thread()
{
g_atomic1.store(1, std::memory_order_relaxed);
// compiler store-load fence
if (g_atomic2.load(std::memory_order_relaxed)) [...]

Click to expand...

Are you sure that atomic store with 'memory_order_relaxed' semantics
injects
a #StoreLoad after the operation?

Click to expand...

Whoops! I did not read the word COMPILER! Sorry about that. Anyway, I think
that
'g_atomic2.load()' can rise above 'g_atomic1.store()'...

.... if you don't use system_wide_synchronize() on "the other side"

It's a sketch of reader-side of asymmetric reader-writer mutex.

Dmitriy V'jukov

Chris Thomasson · May 5, 2008

Szabolcs Ferenczi said:
Szabolcs Ferenczi said:

[...]
All accesses to shared data MUST be synchronized with atomics: [...]

Click to expand...

Can you elaborate this point please. How can you generally synchronise
N processes with help of atomics? Do you mean only two processes under
certain circumstances?

Click to expand...

If any thread modifies shared data that is not of type atomic_xxx, the
developer must ensure appropriate synchronization with any other thread
that
accesses that shared data in order to avoid a data race (and the
undefined
behaviour that comes with that).

Click to expand...

It is clear that you must synchronise access to shared variable.
Normally you must use a Critical Region for that.

I was curious how do you synchronise access to shared data with
atomics.

Really? How do you think some mutexs, semaphores, non-blocking algorithms
ect, ect, are actually implemented? IMHO, C++ should be at a low enough
level to create fairly efficient custom sync primitives. You can use atomics
to create different forms of reader-writer patterns. Are you familiar with
basic concepts of RCU?

Note that atomics only provide this synchronisation for the
access of the atomics themselves but you claimed something like with
atomics you can synchronise access to non-atomic shared data. How? Can
you provide example, please.

[...]

Here is one way to use atomics:

#ifndef MUTEX_ERROR_UNEXPECTED
# define MUTEX_ERROR_UNEXPECTED assert(false), std::unexcepted
#endif

class mutex {
enum constant {
UNLOCKED = 0,
LOCKED = 1,
CONTENTION = 2
};

atomic_word m_state;
os_event m_waitset;

public:
mutex() : m_state(UNLOCKED), (os_event_create(...)) {
if (m_waitset == OS_EVENT_INVALID) {
throw std::exception();
}
}

~mutex() throw() {
if (m_state != UNLOCKED || ! os_event_destroy(m_waitset)) {
MUTEX_ERROR_UNEXPECTED();
}
}

void lock() throw() {
if (ATOMIC_SWAP(&m_state, LOCKED)) {
while (ATOMIC_SWAP(&m_state, CONTENTION)) {
if (! os_event_wait(m_waitset)) {
MUTEX_ERROR_UNEXPECTED();
}
}
}
MEMBAR #StoreLoad | #StoreStore;
}

bool trylock() throw() {
if (! ATOMIC_CAS(&m_state, UNLOCKED, LOCKED)) {
return false;
}
MEMBAR #StoreLoad | #StoreStore;
}

void unlock() throw() {
MEMBAR #LoadStore | #StoreStore;
if (ATOMIC_SWAP(&m_state, UNLOCKED) == CONTENTION) {
if (! os_event_set(m_waitset)) {
MUTEX_ERROR_UNEXPECTED();
}
}
}
};

;^)

Do you think that C++ should _not_ be at a level that is low enough to
create custom non-blocking synchronization primitives?

James Kanze · May 5, 2008

[...]
All accesses to shared data MUST be synchronized with atomics: [...]
Can you elaborate this point please. How can you
generally synchronise N processes with help of atomics?
Do you mean only two processes under certain
circumstances?
If any thread modifies shared data that is not of type
atomic_xxx, the developer must ensure appropriate
synchronization with any other thread that accesses that
shared data in order to avoid a data race (and the
undefined behaviour that comes with that).

Click to expand...

It is clear that you must synchronise access to shared
variable. Normally you must use a Critical Region for that.

Click to expand...

It would be interesting to know what you mean by a "Critical
Region". My understanding of it is simply that it is a section
of code which is protected by some synchronization mechanism, so
that 1) only one thread can enter it at a time, and 2) memory
synchronization occurs on entering and leaving. In which case,
it's neither necessary nor sufficient to ensure "appropriate
synchronization". But I think Microsoft has redefined it to
mean something else.

Really? How do you think some mutexs, semaphores, non-blocking
algorithms ect, ect, are actually implemented?

They synchronize access. On my usual platform (Sun Sparc), this
means inserting the appropriate membar instructions before
and/or after the access. Something which, of course, you can't
do in current C++; you need some inline assembler. (But you
know that.)

--
James Kanze (GABI Software) email:[email protected]
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Szabolcs Ferenczi · May 5, 2008

Szabolcs Ferenczi said:
Szabolcs Ferenczi said:

[...]
All accesses to shared data MUST be synchronized with atomics: [...]
Can you elaborate this point please. How can you generally synchronise
N processes with help of atomics? Do you mean only two processes under
certain circumstances?
If any thread modifies shared data that is not of type atomic_xxx, the
developer must ensure appropriate synchronization with any other thread that
accesses that shared data in order to avoid a data race (and the undefined
behaviour that comes with that).

Click to expand...

Click to expand...

It is clear that you must synchronise access to shared variable.
Normally you must use a Critical Region for that.

Click to expand...

I was curious how do you synchronise access to shared data with
atomics. Note that atomics only provide this synchronisation for the
access of the atomics themselves but you claimed something like with
atomics you can synchronise access to non-atomic shared data. How? Can
you provide example, please.

Click to expand...

If you use acquire/release pairs or seq_cst atomics, then they introduce a
synchronizes-with relation between threads. This in turn introduces a
happens-before relation, which makes the data modified by the thread that did
the store/release visible to the thread that did the load/acquire. This allows
the use of atomics to build mutexes and other synchronization primitives.

std::atomic_flag f=ATOMIC_FLAG_INIT;
std::vector<std::string> shared_data;

void thread_a()
{
while(f.test_and_set(std::memory_order_acq_rel)); // spin lock
shared_data.push_back("hello");
f.clear(std::memory_order_release); // unlock

}

void thread_b()
{
while(f.test_and_set(std::memory_order_acq_rel)); // spin lock
if(!shared_data.empty())
{
std::cout<<shared_data.front()<<std::endl;
shared_data.pop_front();
}
f.clear(std::memory_order_release); // unlock

}

The clear() from thread_a is a release operation, so synchronizes with the
test_and_set (which is an acquire) in thread_b, which can therefore see the
modification to shared_data, since the push_back happens-before the clear,
which happens-before the test-and-set, which happens-before the accesses in
thread_b.

Thank you for the clarifying code fragment. That makes it clear what
you meant. I think it works and it is correct (if we ignore the timing
issues that if thread_b wins the race you will see no output).

It can be a personal taste but I do not like the way you explain it
since your explanation misses the main issue why it works. I would say
it is not about any `visibility' issues nor is it about so-called
happens-before relation in the first place but there are two factors
that make it work:

1) the atomic test-and-set
2) the waiting loop

These two important issues were not clear from your initial statement:
"All accesses to shared data MUST be synchronized with atomics".
Actually, I would not put it this way either since it is the waiting
loop that synchronises and not the atomic operation alone but the
example is clear at least.

I would say: Based on atomics, you can make a hand built spin lock
and, in turn, from the spin lock you can make a hand built Critical
Region to synchronise access to shared data. Actually, the code
example illustrates this very well, no matter how we explain it.

[...]
Of course, you could also just use a mutex lock or join with the thread doing
the modification.

Click to expand...

Click to expand...

That is correct. You can synchronise access with mutexes (implementing
a Critical Region by hand).

Click to expand...

I think you refer with the "join with the thread" phrase the end of a
structured parallel block where a shared variable becomes a non-shared
one. Again, an example could help. Please give an example illustrating
what you mean.

Click to expand...

std::thread t(thread_a); // thread_a from above
t.join();

assert(shared_data.back()=="hello");

This was what I meant as well based on your wording. Thanks for the
clarification.

Best Regards,
Szabolcs

C++0x: memory_order_acq_rel vs memory_order_seq_cst	9	May 1, 2008
Subtle difference between C++0x MM and other MMs	25	Aug 24, 2008
Ill-formed C++0x code or compiler bug (GCC) ?	5	Jan 28, 2011
C++0x memory model and atomics, some questions	5	Sep 1, 2010
c++0x: Is it possible to make the compiler choose scope if unambiguous?	4	Mar 2, 2011
C++0x: Communication with signal handler	3	Sep 8, 2008
Which functions act as memory barriers?	4	Nov 29, 2007
Synchronization Algorithm Verificator for C++0x	31	Aug 2, 2008

Compiler ordering barriers in C++0x

Dmitriy Vyukov

nickf3

Dmitriy Vyukov

Dmitriy Vyukov

Anthony Williams

Dmitriy V'jukov

Anthony Williams

Chris Thomasson

Szabolcs Ferenczi

Anthony Williams

Dmitriy Vyukov

Chris Thomasson

James Kanze

Szabolcs Ferenczi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads