C++0x memory model and atomics, some questions

J

Joshua Maurice

So, I'm trying to learn more about threading. Which papers are the
most current? Which are the most readable breakdowns which I might
comprehend without delving too deep into standardese?

I've thus far found these links to be of great use:

http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html

Here's some questions.

1-
So, we now have the stringent requirement that an implementation
cannot "introduce races", including specifically that an
implementation cannot introduce a write to a memory location on a code
path where there was none. This has big implications. No longer is an
implementation allowed to store an 8 bit unit by reading teh current
value of a whole 32 or 64 bit word, use bit shifting and bit masking,
and then write back the whole 32 or 64 bit word to modify the 8 bit
memory location. This means that writing to certain smaller types
might be a lot slower now (though it's my understanding most sane
architectures did this anyway).

It's also my understanding that the "int" type was meant at some point
to be the "fastest" integer type for the implementation. However, is
this the case nowadays when "int" on a lot of 64 bit systems is 32 bit
for "compatibility reasons"? Also specifically for the atomic compare
and exchange, how can I know which ones will be implemented via a
hardware compare and swap, and which ones need to be emulated?
Offhand, I would figure out on each system which is the "efficient"
base integer type, and then I would provide a typedef, and I would use
that typedef in portable code.

2-
Related: Is anyone going to use std::atomic for anything but integer
types? Some of the implementation notes from
http://gcc.gnu.org/wiki/Atomic/GCCMM
give me pause. How efficient would a std::atomic<some_complex_type> be
vs an explicit lock guarding the relevant pieces of code? What's the
expectation of use of std::atomic on complex user types and the
expected implementation? A (static) lock per type? A lock hidden in
each atomic object which is big enough or complex enough?

3-
http://gcc.gnu.org/wiki/Atomic/GCCMM/ImplemSumm
Optimizer effects
It is worth noting that the impact on the optimizer is slightly different with the different models. Although the existence of an atomic load or store operations will act as an optimization barrier, some optimizations can be performed on the atomic operations themselves if they are not in sequentially consistent mode.
For instance, dead store removal can be performed:
x.store (a, memory_order_release);
x.store (b, memory_order_release);
It is perfectly safe to note that storing 'a' into 'x' is in fact a dead store, and remove it. Its also possible to move statements around, subject to the various dependencies imposed by the memory model mode.

This is incorrect, right? Specifically,
x.store (a);
x.store (b);
aka
x.store (a, memory_order_seq_cst);
x.store (b, memory_order_seq_cst);
can be optimized to
x.store (b, memory_order_seq_cst);
right? An implementation does not need to faithfully do every single
atomic load and store with memory_order_seq_cst, right?

4-
Let's consider cooperative thread canceling. Suppose each thread has a
stop flag, a std::atomic<bool>. When you want to cancel a thread, you
set that thread's stop flag. The thread periodically checks its stop
flag's value, and if it's set, it starts returning error codes (or
throwing exceptions) to break out and finish.

As I understand it, with the default memory_order_seq_cst, there is a
total order across all memory_order_seq_cst operations on all memory
locations. So, if the write and reads to the stop flag all used
memory_order_seq_cst, then the thread will notice the stop request
quite quickly, basically as soon as the scheduler schedules it.
However, with any of the other memory_order's, the update may take
arbitrary long to become visible (though QoI should make it visible
"soon"). As I don't need this stop flag to give any other additional
guarantees, I only need to decide between memory_order_relaxed and
memory_order_seq_cst, right? Should I favor one over the other? Why?

5-
I remember this one thread from a while ago on comp.lang.c+
+.moderated.
http://groups.google.com.ng/group/comp.lang.c++.moderated/msg/2a9e0dc41f977db7

It was a large discussion over the usefulness and portability of
volatile as a threading primitive. I recall that I very strongly said
don't sprinkle your code with it, though it's perhaps acceptable to
use volatile to implement a more standard primitive on whatever
machines actually guarantee useful threading semantics (like some
windows ones).

I specifically remembered that post by Andy Venikov as a potentially
legitimate use of volatile. I replied to it saying that as long as the
compiler didn't do anything silly like speculatively load some things,
and reloaded all things, then it would work. I also said that it
seemed reasonable, but I'm not sure any machine actually guarantees
any of that. I felt that something was wrong, but I was not and am not
the most knowledgeable about assembly so I couldn't put my finger on
it. Now I think I can. He has a data dependency.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm

What Andy Venikov was trying to do seems to be exactly the use case
for std::memory_order_consume. Right? Without that, his volatile hack
is just that, a hack.

6-
I definitely do not get all of the nuances involved with
std::memory_order_consume. The paper
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm
goes into some details, but it's not quite formal enough for my
liking, and/or the standardese is too standardesy for my liking. Any
better sources on this?
 
J

Joshua Maurice

3-
http://gcc.gnu.org/wiki/Atomic/GCCMM/ImplemSumm


This is incorrect, right? Specifically,
  x.store (a);
  x.store (b);
aka
  x.store (a, memory_order_seq_cst);
  x.store (b, memory_order_seq_cst);
can be optimized to
  x.store (b, memory_order_seq_cst);
right? An implementation does not need to faithfully do every single
atomic load and store with memory_order_seq_cst, right?

My reading of the draft standard
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf
confirms, though it strongly suggests that an implementation should
not do this in an unbounded loop.
4-
Let's consider cooperative thread canceling. Suppose each thread has a
stop flag, a std::atomic<bool>. When you want to cancel a thread, you
set that thread's stop flag. The thread periodically checks its stop
flag's value, and if it's set, it starts returning error codes (or
throwing exceptions) to break out and finish.

As I understand it, with the default memory_order_seq_cst, there is a
total order across all memory_order_seq_cst operations on all memory
locations. So, if the write and reads to the stop flag all used
memory_order_seq_cst, then the thread will notice the stop request
quite quickly, basically as soon as the scheduler schedules it.
However, with any of the other memory_order's, the update may take
arbitrary long to become visible (though QoI should make it visible
"soon"). As I don't need this stop flag to give any other additional
guarantees, I only need to decide between memory_order_relaxed and
memory_order_seq_cst, right? Should I favor one over the other? Why?

Again, the draft standard
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf
says that an implementation "should" make atomic stores visible to
other threads in a "reasonable" amount of time. If the implementation
follows this note, then I guess memory_order_relaxed is the way to go.
I'm just concerned that the implementation might not.
 
J

Joshua Maurice

One other question.

-7
// Initial conditions
(not atomic) int x = 0, y = 0

// Thread 1
if (x == 1)
y = 1;

//Thread 2
if (y == 1)
x = 1;

// ----
So, given those initial conditions, for those two threads started
concurrently, can you end up with x == 1 and y == 1? Does this
(pseudo) C++ program have formally undefined behavior?

Is there real hardware which can end with x == 1 and y == 1 for a
simplistic naive translation to assembly? Or is there hypothesized
useful hardware which can do that? My naive understanding says no, but
I'm learning to question such simplistic sequentially consistent
reasoning.
 
A

Alexander Terekhov

Joshua said:
One other question.

-7
// Initial conditions
(not atomic) int x = 0, y = 0

// Thread 1
if (x == 1)
y = 1;

//Thread 2
if (y == 1)
x = 1;

// ----
So, given those initial conditions, for those two threads started
concurrently, can you end up with x == 1 and y == 1? Does this
(pseudo) C++ program have formally undefined behavior?

No and no.
Is there real hardware which can end with x == 1 and y == 1 for a
simplistic naive translation to assembly?
No.

Or is there hypothesized
useful hardware which can do that? My naive understanding says no, but
I'm learning to question such simplistic sequentially consistent
reasoning.

A hypothesized hardware can speculate that x/y is 1 and proceed with an
uncommitted modification (preserving the old value until commit) just to
roll it back (undo modification) a bit later when the speculation turns
out to be false. If such hypothesized hardware would make the
uncommitted modification visible to other threads then there would be a
problem.

regards,
alexander.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top