Synchronization Algorithm Verificator for C++0x

Dmitriy V'jukov · Aug 2, 2008

I want to announce tool called Relacy Race Detector, which I've
developed. It's synchronization algorithm verificator for C++0x's
relaxed memory model. The tool is implemented in the form of header-
only library for C++03, which can be used for efficient execution of
unit-tests for synchronization algorithms. The tool executes unit-test
many times under control of special scheduler, on every execution
scheduler models different interleaving between threads, at the same
time every execution is exhaustively analyzed for data races, accesses
to freed memory, failed assertions etc. If no errors found then
verification terminates when particular number of interleavings are
verified (for random scheduler), or when all possible interleavings
are verified (for full search scheduler). If error is found then tool
outputs detailed execution history which leads to
error and terminates.
The tool was designed for verification of algorithms like memory
management, memory reclamation for lock-free algorithms, multithreaded
containers (queues, stacks, maps), mutexes, eventcounts and so on.
My personal subjective feeling to date is that tool is capable of
finding extremely subtle algorithmic errors, memory fence placement
errors and memory fence type errors within a second. And after error
is detected error fixing is relatively trivial, because one has
detailed execution history which leads to error.

Main features:
- Relaxed ISO C++0x Memory Model. Relaxed/acquire/release/acq_rel/
seq_cst memory operations and fences. The only non-supported feature
is memory_order_consume, it's simulated with memory_order_acquire.
- Exhaustive automatic error checking (including ABA detection).
- Full-fledged atomics library (with spurious failures in
compare_exchange()).
- Arbitrary number of threads.
- Detailed execution history for failed tests.
- Before/after/invariant functions for test suites.
- No false positives.

Types of detectable errors:
- Race condition (according to ISO C++0x definition)
- Access to uninitialized variable
- Access to freed memory
- Double free
- Memory leak
- Deadlock
- Livelock
- User assert failed
- User invariant failed

You can get some more information (tutorial, examples etc) here:
http://groups.google.com/group/relacy/web

And here is dedicated news group/discussion forum:
http://groups.google.com/group/relacy/topics

If you want to get a copy of the tool, please, contact me via
(e-mail address removed).

Any feedback, comments, suggestions are welcome.

Relacy Race Detector is free for open-source, non-commercial
development; research with non-patented, published results;
educational purposes; home/private usage. Please contact me
([email protected]) about other usages.

--------------------------------------------------------
Here is quick example.
Code of unit-test:

#include <relacy/relacy_std.hpp>

// template parameter '2' is number of threads
struct race_test : rl::test_suite<race_test, 2>
{
std::atomic<int> a;
rl::var<int> x;

// executed in single thread before main thread function
void before()
{
a($) = 0;
x($) = 0;
}

// main thread function
void thread(unsigned thread_index)
{
if (0 == thread_index)
{
// code executed by first thread
x($) = 1;
a($).store(1, rl::memory_order_relaxed);
}
else
{
// code executed by second thread
if (1 == a($).load(rl::memory_order_relaxed))
x($) = 2;
}
}

// executed in single thread after main thread function
void after()
{
}

// executed in single thread after every
// 'visible' action in main threads.
// disallowed to modify any state
void invariant()
{
}
};

int main()
{
// start simulation
rl::simulate<race_test>();
}

And here is output of the tool:

struct race_test
DATA RACE
iteration: 8

execution history:
[0] 0: <00366538> atomic store, value=0, (prev value=0),
order=seq_cst, in race_test::before, test.cpp(14)
[1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
[2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
[3] 0: <00366538> atomic store, value=1, (prev value=0),
order=relaxed, in race_test::thread, test.cpp(24)
[4] 1: <00366538> atomic load, value=1, order=relaxed, in
race_test::thread, test.cpp(28)
[5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
[6] 1: data race detected, in race_test::thread, test.cpp(29)

thread 0:
[0] 0: <00366538> atomic store, value=0, (prev value=0),
order=seq_cst, in race_test::before, test.cpp(14)
[1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
[2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
[3] 0: <00366538> atomic store, value=1, (prev value=0),
order=relaxed, in race_test::thread, test.cpp(24)

thread 1:
[4] 1: <00366538> atomic load, value=1, order=relaxed, in
race_test::thread, test.cpp(28)
[5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
[6] 1: data race detected, in race_test::thread, test.cpp(29)

Dmitriy V'jukov · Aug 3, 2008

I want to announce tool called Relacy Race Detector, which I've
developed. It's synchronization algorithm verificator for C++0x's
relaxed memory model.

Q: Can I use Relacy Race Detector to check my algo againts other that C
++0x memory models (x86, PPC, Java, CLI)?

A Yes, you can. Fortunately, C++0x memory model is very relaxaed, so
for the main part it's a "superset" of basically any other memory
model. You just have to define "binding" between your target memory
model and C++0x memory model.

Let's create such binding, for example, for x86 memory model:
- plain load operation is always "acquire" (i.e.
memory_order_acquire)
- plain store operation is always "release" (i.e.
memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e.
memory_order_seq_cst)
- mfence instruction is
std::atomic_thread_fence(memory_order_seq_cst)
That is all. You can create such bindings for other hardware memory
models you are interested in (PPC, Itatium, SPARC etc).

And you can define such binding to other abstract memory model, like
Java MM. Let's see:
- plain load is "relaxed" (i.e. memory_order_relaxed)
- plain store is "relaxed" (i.e. memory_order_relaxed)
- volatile load is "acquire" (i.e. memory_order_acquire)
- volatile store operation is "release" (i.e. memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e.
memory_order_seq_cst)
But here are some caveats. First, you have to emulate work of GC, i.e.
put all allocated memory to special list, and free all allocated
memory in test_suite::after(). Second, you have to manually emit
sequentially consistent memory fence between every volatile store and
volatile load. Third, you have to manually initialize all variables to
default value (0). Fourth, there is no such thing as data race, so you
have to define all variables as std::atomic, this will effectively
disable data race detection mechanizm. Well, actually you can use
rl::var variables, if you know that there must be no concurrent
accesses to variable, this will enable some automatic error detection
wrt data races.
Sounds not very cool... So I'm going to add built-in support for Java
and CLI. Then user would have to define something like RL_JAVA_MODE/
RL_CLI_MODE, and get all those things out-of-the-box. But yes, it
still will be C++ library. What do you think?

Dmitriy V'jukov

Chris M. Thomasson · Aug 5, 2008

Dmitriy V'jukov said:
I want to announce tool called Relacy Race Detector, which I've
developed. It's synchronization algorithm verificator for C++0x's
relaxed memory model. The tool is implemented in the form of header-
only library for C++03, which can be used for efficient execution of
unit-tests for synchronization algorithms. The tool executes unit-test
many times under control of special scheduler, on every execution
scheduler models different interleaving between threads, at the same
time every execution is exhaustively analyzed for data races, accesses
to freed memory, failed assertions etc. If no errors found then
verification terminates when particular number of interleavings are
verified (for random scheduler), or when all possible interleavings
are verified (for full search scheduler). If error is found then tool
outputs detailed execution history which leads to
error and terminates.
The tool was designed for verification of algorithms like memory
management, memory reclamation for lock-free algorithms, multithreaded
containers (queues, stacks, maps), mutexes, eventcounts and so on.
My personal subjective feeling to date is that tool is capable of
finding extremely subtle algorithmic errors, memory fence placement
errors and memory fence type errors within a second. And after error
is detected error fixing is relatively trivial, because one has
detailed execution history which leads to error.

[...]

Have you testing the following algorithm yet?

http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

I thought about coding it up. However, I think the algorithm might have a
patent application.

Chris M. Thomasson · Aug 5, 2008

Dmitriy V'jukov said:
I want to announce tool called Relacy Race Detector, which I've
developed. It's synchronization algorithm verificator for C++0x's
relaxed memory model. The tool is implemented in the form of header-
only library for C++03, which can be used for efficient execution of
unit-tests for synchronization algorithms. The tool executes unit-test
many times under control of special scheduler, on every execution
scheduler models different interleaving between threads, at the same
time every execution is exhaustively analyzed for data races, accesses
to freed memory, failed assertions etc. If no errors found then
verification terminates when particular number of interleavings are
verified (for random scheduler), or when all possible interleavings
are verified (for full search scheduler). If error is found then tool
outputs detailed execution history which leads to
error and terminates.
The tool was designed for verification of algorithms like memory
management, memory reclamation for lock-free algorithms, multithreaded
containers (queues, stacks, maps), mutexes, eventcounts and so on.
My personal subjective feeling to date is that tool is capable of
finding extremely subtle algorithmic errors, memory fence placement
errors and memory fence type errors within a second. And after error
is detected error fixing is relatively trivial, because one has
detailed execution history which leads to error.

[...]

After you integrate a mutex and condvar into the detection framework, I
would like to model an eventcount.

Also, it seems like you could parallelize the testing process somewhat... I
need to think about this some more.

Anyway, nice work.

BTW, you ask if there is any market for this type of tool in an e-mail, well
I hope you don't mind if I answer that here...

IMVOH, when C++ 0x is finally released, it should have a fairly big impact.
I can see it now... You trademark a logo, and give a company the chance to
stick it on their products to give their customers piece of mind:

Customer: "Well, the software has the Relacy logo on the front of the box;
it must be correct concurrent software indeed!"

Vendor: "Are software lives up to the prerequisites of the Relacy logo. We
are proud to be able to display it on our software packaging."

I hope you make tons of $$$ with this tool Dmitriy. Get rich and, then a
little richer; AFAICT its a winner.

Any thoughts?

=^D

Chris M. Thomasson · Aug 5, 2008

Chris M. Thomasson said:
Dmitriy V'jukov said:

I want to announce tool called Relacy Race Detector, which I've
developed.

Click to expand...

[...]

[...]

After you integrate a mutex and condvar into the detection framework, I
would like to model an eventcount.

I would be interested to see how long it takes to detect the following bug
you found in an older version of AppCore:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

[...]

Dmitriy V'jukov · Aug 6, 2008

I would be interested to see how long it takes to detect the following bug
you found in an older version of AppCore:

http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Ok, let's see. I've just finished basic mutex and condvar
implementation.

Here is eventcount:

class eventcount
{
public:
eventcount()
: count(0)
, waiters(0)
{}

void signal_relaxed()
{
unsigned cmp = count($).load(std::memory_order_relaxed);
signal_impl(cmp);
}

void signal()
{
unsigned cmp = count($).fetch_add(0, std::memory_order_seq_cst);
signal_impl(cmp);
}

unsigned get()
{
unsigned cmp = count($).fetch_or(0x80000000,
std::memory_order_seq_cst);
return cmp & 0x7FFFFFFF;
}

void wait(unsigned cmp)
{
unsigned ec = count($).load(std::memory_order_seq_cst);
if (cmp == (ec & 0x7FFFFFFF))
{
guard.lock($);
ec = count($).load(std::memory_order_seq_cst);
if (cmp == (ec & 0x7FFFFFFF))
{
waiters($) += 1;
cv.wait(guard, $);
}
guard.unlock($);
}
}

private:
std::atomic<unsigned> count;
rl::var<unsigned> waiters;
mutex guard;
condition_variable_any cv;

void signal_impl(unsigned cmp)
{
if (cmp & 0x80000000)
{
guard.lock($);
while (false == count($).compare_swap(cmp,
(cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
unsigned w = waiters($);
waiters($) = 0;
guard.unlock($);
if (w)
cv.notify_all($);
}
}
};

Here is the queue:

template<typename T>
class spsc_queue
{
public:
spsc_queue()
{
node* n = RL_NEW node ();
head($) = n;
tail($) = n;
}

~spsc_queue()
{
RL_ASSERT(head($) == tail($));
RL_DELETE((node*)head($));
}

void enqueue(T data)
{
node* n = RL_NEW node (data);
head($)->next($).store(n, std::memory_order_release);
head($) = n;
ec.signal_relaxed();
}

T dequeue()
{
T data = try_dequeue();
while (0 == data)
{
int cmp = ec.get();
data = try_dequeue();
if (data)
break;
ec.wait(cmp);
data = try_dequeue();
if (data)
break;
}
return data;
}

private:
struct node
{
std::atomic<node*> next;
rl::var<T> data;

node(T data = T())
: next(0)
, data(data)
{}
};

rl::var<node*> head;
rl::var<node*> tail;

eventcount ec;

T try_dequeue()
{
node* t = tail($);
node* n = t->next($).load(std::memory_order_acquire);
if (0 == n)
return 0;
T data = n->data($);
RL_DELETE(t);
tail($) = n;
return data;
}
};

Here is the test:

struct spsc_queue_test : rl::test_suite<spsc_queue_test, 2>
{
spsc_queue<int> q;

void thread(unsigned thread_index)
{
if (0 == thread_index)
{
q.enqueue(11);
}
else
{
int d = q.dequeue();
RL_ASSERT(11 == d);
}
}
};

int main()
{
rl::simulate<spsc_queue_test>();
}

It takes around 40 minutes.

Let's run it!
Here is output (note error detected on iteration 3):

struct spsc_queue_test
DEADLOCK
iteration: 3

execution history:
[0] 1: [BEFORE BEGIN], in rl::context_impl<struct
spsc_queue_test,class rl::random_scheduler<2> >::fiber_proc_impl,
context.hpp(341)
[1] 1: <0035398C> atomic store, value=-1, (prev value=0),
order=relaxed, in rl::generic_mutex<0>::generic_mutex, stdlib/
mutex.hpp(77)
[2] 1: memory allocation: addr=00353A80, size=52, in
spsc_queue<int>::spsc_queue, peterson.cpp(397)
[3] 1: <00353938> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(398)
[4] 1: <00353948> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(399)
[5] 1: [BEFORE END], in rl::context_impl<struct spsc_queue_test,class
rl::random_scheduler<2> >::fiber_proc_impl, context.hpp(356)
[6] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[7] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[8] 1: <00353958> fetch_or , prev=0, op=2147483648, new=2147483648,
order=seq_cst, in eventcount::get, peterson.cpp(349)
[9] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[10] 0: memory allocation: addr=0035BCA0, size=52, in
spsc_queue<int>::enqueue, peterson.cpp(410)
[11] 0: <00353938> load, value=00353A80, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[12] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[13] 0: <00353A80> atomic store, value=0035BCA0, (prev
value=00000000), order=release, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[14] 0: <00353938> store, value=0035BCA0, in spsc_queue<int>::enqueue,
peterson.cpp(412)
[15] 0: <00353958> atomic load, value=0 [NOT CURRENT], order=relaxed,
in eventcount::signal_relaxed, peterson.cpp(337)
[16] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(355)
[17] 1: <0035398C> CAS succ orig=-1, cmp=-1, xchg=1, order=acquire, in
rl::generic_mutex<0>::lock, stdlib/mutex.hpp(117)
[18] 1: <0035398C> mutex: lock, in eventcount::wait, peterson.cpp(358)
[19] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(359)
[20] 1: <0035397C> load, value=0, in eventcount::wait,
peterson.cpp(362)
[21] 1: <0035397C> store, value=1, in eventcount::wait,
peterson.cpp(362)
[22] 1: <0035398C> atomic store, value=-1, (prev value=1),
order=release, in rl::generic_mutex<0>::unlock, stdlib/mutex.hpp(209)
[23] 1: <0035398C> mutex: unlock, in eventcount::wait,
peterson.cpp(363)
[24] 1: blocking thread, in eventcount::wait, peterson.cpp(363)
[25] 1: deadlock detected, in eventcount::wait, peterson.cpp(363)

thread 0:
[10] 0: memory allocation: addr=0035BCA0, size=52, in
spsc_queue<int>::enqueue, peterson.cpp(410)
[11] 0: <00353938> load, value=00353A80, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[13] 0: <00353A80> atomic store, value=0035BCA0, (prev
value=00000000), order=release, in spsc_queue<int>::enqueue,
peterson.cpp(411)
[14] 0: <00353938> store, value=0035BCA0, in spsc_queue<int>::enqueue,
peterson.cpp(412)
[15] 0: <00353958> atomic load, value=0 [NOT CURRENT], order=relaxed,
in eventcount::signal_relaxed, peterson.cpp(337)

thread 1:
[0] 1: [BEFORE BEGIN], in rl::context_impl<struct
spsc_queue_test,class rl::random_scheduler<2> >::fiber_proc_impl,
context.hpp(341)
[1] 1: <0035398C> atomic store, value=-1, (prev value=0),
order=relaxed, in rl::generic_mutex<0>::generic_mutex, stdlib/
mutex.hpp(77)
[2] 1: memory allocation: addr=00353A80, size=52, in
spsc_queue<int>::spsc_queue, peterson.cpp(397)
[3] 1: <00353938> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(398)
[4] 1: <00353948> store, value=00353A80, in
spsc_queue<int>::spsc_queue, peterson.cpp(399)
[5] 1: [BEFORE END], in rl::context_impl<struct spsc_queue_test,class
rl::random_scheduler<2> >::fiber_proc_impl, context.hpp(356)
[6] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[7] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[8] 1: <00353958> fetch_or , prev=0, op=2147483648, new=2147483648,
order=seq_cst, in eventcount::get, peterson.cpp(349)
[9] 1: <00353948> load, value=00353A80, in
spsc_queue<int>::try_dequeue, peterson.cpp(452)
[12] 1: <00353A80> atomic load, value=00000000, order=acquire, in
spsc_queue<int>::try_dequeue, peterson.cpp(453)
[16] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(355)
[17] 1: <0035398C> CAS succ orig=-1, cmp=-1, xchg=1, order=acquire, in
rl::generic_mutex<0>::lock, stdlib/mutex.hpp(117)
[18] 1: <0035398C> mutex: lock, in eventcount::wait, peterson.cpp(358)
[19] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
eventcount::wait, peterson.cpp(359)
[20] 1: <0035397C> load, value=0, in eventcount::wait,
peterson.cpp(362)
[21] 1: <0035397C> store, value=1, in eventcount::wait,
peterson.cpp(362)
[22] 1: <0035398C> atomic store, value=-1, (prev value=1),
order=release, in rl::generic_mutex<0>::unlock, stdlib/mutex.hpp(209)
[23] 1: <0035398C> mutex: unlock, in eventcount::wait,
peterson.cpp(363)
[24] 1: blocking thread, in eventcount::wait, peterson.cpp(363)
[25] 1: deadlock detected, in eventcount::wait, peterson.cpp(363)

Operations on condition variables are not yet in execution history.

Let's replace 'ec.signal_relaxed()' with 'ec.signal()'. Here is the
output:

struct spsc_queue_test
6% (65536/1000000)
13% (131072/1000000)
19% (196608/1000000)
26% (262144/1000000)
32% (327680/1000000)
39% (393216/1000000)
45% (458752/1000000)
52% (524288/1000000)
58% (589824/1000000)
65% (655360/1000000)
72% (720896/1000000)
78% (786432/1000000)
85% (851968/1000000)
91% (917504/1000000)
98% (983040/1000000)
iterations: 1000000
total time: 2422
throughput: 412881

Test passed. Throughput is around 400'000 test executions per second.

Also test reveals some errors with memory fences.
In signaling function you are using acquire fence in cas, but relaxed
is enough here:
void signal_impl(unsigned cmp)
{
if (cmp & 0x80000000)
{
guard.lock($);
while (false == count($).compare_swap(cmp,
(cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
unsigned w = waiters($);
waiters($) = 0;
guard.unlock($);
if (w)
cv.notify_all($);
}
}

Also. Following naive approach doesn't work (at least in C++0x):
void signal()
{
std::atomic_thread_fence(std::memory_order_seq_cst);
signal_relaxed();
}
You have to execute sequentially consistent atomic RMW on 'ec'
variable here. Because sequentially consistent fence and sequentially
consistent atomic RMW operation doesn' t synchronize-with each other.
Hmmm... This is strange. But I can't find anything about this in
draft... Anthony Williams said that committee made some changes to
initial proposal on fences, but they are not yet published...

Dmitriy V'jukov

Chris M. Thomasson · Aug 6, 2008

Ok, let's see. I've just finished basic mutex and condvar
implementation.

Can you e-mail the new version to me please?

Here is eventcount:

class eventcount
{
[...]
void wait(unsigned cmp)
{
unsigned ec = count($).load(std::memory_order_seq_cst);
if (cmp == (ec & 0x7FFFFFFF))
{
guard.lock($);
ec = count($).load(std::memory_order_seq_cst);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I don't think that you need a memory barrier under the lock because the
only mutation which will make the following compare succeed is also
performed under the same lock. The signal-slow path is under the lock, and
the waiter slow-path is under the lock. However, I don't know if this
violates some rule in C++ 0x.

if (cmp == (ec & 0x7FFFFFFF))
{
waiters($) += 1;
cv.wait(guard, $);
}
guard.unlock($);
}
}
};

Here is the queue:
[...]

Here is the test:
[...]

It takes around 40 minutes.

Let's run it!
Here is output (note error detected on iteration 3):

struct spsc_queue_test
DEADLOCK
iteration: 3

Cool; it sure does detect the race-condition in the old AppCore! Very nice
indeed.

[...]

Operations on condition variables are not yet in execution history.

Let's replace 'ec.signal_relaxed()' with 'ec.signal()'. Here is the
output:

struct spsc_queue_test [...]
iterations: 1000000
total time: 2422
throughput: 412881

Test passed. Throughput is around 400'000 test executions per second.

Beautiful. My eventcount algorihtm works!

=^D

Also test reveals some errors with memory fences.
In signaling function you are using acquire fence in cas, but relaxed
is enough here:
void signal_impl(unsigned cmp)
{
if (cmp & 0x80000000)
{
guard.lock($);
while (false == count($).compare_swap(cmp,
(cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
unsigned w = waiters($);
waiters($) = 0;
guard.unlock($);
if (w)
cv.notify_all($);
}
}

Right. The current public AppCore atomic API is not fine-grain enough to
allow for relaxed operations. I need to change that.

Also. Following naive approach doesn't work (at least in C++0x):
void signal()
{
std::atomic_thread_fence(std::memory_order_seq_cst);
signal_relaxed();
}

Good thing the MFENCE works in x86!

You have to execute sequentially consistent atomic RMW on 'ec'
variable here. Because sequentially consistent fence and sequentially
consistent atomic RMW operation doesn' t synchronize-with each other.
Hmmm... This is strange. But I can't find anything about this in
draft... Anthony Williams said that committee made some changes to
initial proposal on fences, but they are not yet published...

This is weird. a seq_cst fence does not work with seq_cst RMW!? Do you
happen to know the rational for this? IMVHO, it makes no sense at all... I
must be missing something.

:^/

Dmitriy V'jukov · Aug 6, 2008

Have you testing the following algorithm yet?

http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

Yes and no. No, I didn't test this particular algorithm. But I test my
analogous algorithm for work-stealing deque.
It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
it manually, I was trying to insert some asserts, I was trying to
insert some "yields", I was trying to review it several times, I was
trying to analyze dumps. Uselessly.
When I run it under Relacy Race Detector I localize and fix 2 subtle
algo errors and clarify some moments with memory fences it about 3
hours. And it was almost mechanical work, not some black magic.

Dmitriy V'jukov

Dmitriy V'jukov · Aug 6, 2008

Also, it seems like you could parallelize the testing process somewhat... I
need to think about this some more.

Yes, this is in todo list:
http://groups.google.com/group/relacy/web/todo-feature-list

Feel free to comment on todo/feature list, propose new items, and
prioritize items.
It's better to do this here:
http://groups.google.com/group/relacy/topics

Dmitriy V'jukov

Dmitriy V'jukov · Aug 6, 2008

IMVOH, when C++ 0x is finally released, it should have a fairly big impact.
I can see it now... You trademark a logo, and give a company the chance to
stick it on their products to give their customers piece of mind:

Customer: "Well, the software has the Relacy logo on the front of the box;
it must be correct concurrent software indeed!"

Vendor: "Are software lives up to the prerequisites of the Relacy logo. We
are proud to be able to display it on our software packaging."

I hope you make tons of $$$ with this tool Dmitriy. Get rich and, then a
little richer; AFAICT its a winner.

Any thoughts?

=^D

Sounds great!

I think I can start developing 2 logos:
[RELACY APPROVED]
and the second:
[RELACY BUSTED]

Dmitriy V'jukov

Peter Dimov · Aug 6, 2008

Also. Following naive approach doesn't work (at least in C++0x):
void signal()
{
std::atomic_thread_fence(std::memory_order_seq_cst);
signal_relaxed();
}

I'm not sure that I understand the algorithm here, but the reason this
doesn't introduce a synchronize-with relationship is that the fence
needs to follow the load, not precede it.

load.relaxed count
fence.seq_cst

does synchronize with

store.seq_cst count, v

when the load sees the value v written by the store.

If you do need a leading seq_cst fence, why not just use a seq_cst
load?

Dmitriy V'jukov · Aug 6, 2008

I'm not sure that I understand the algorithm here, but the reason this
doesn't introduce a synchronize-with relationship is that the fence
needs to follow the load, not precede it.

I will try to describe the issue in more detail in separate post in
near future.

load.relaxed count
fence.seq_cst

does synchronize with

store.seq_cst count, v

when the load sees the value v written by the store.

If you do need a leading seq_cst fence, why not just use a seq_cst
load?

Because this seq_cst fence must introduce synchronizes-with not only
from second thread to this thread, but also possibly it must introduce
synchronizes-with from this thread to second thread. I.e. OR first
thread must synchronize-with second thread OR second thread must
synchronize-with first thread (depending on which thread's seq_cst
operations is first in global order).
If I use seq_cst load, then it only possible that second thread
synchronizes-with this thread, but not vice versa, because seq_cst
load is only acquire, it's NOT release.

Dmitriy V'jukov

Dmitriy V'jukov · Aug 6, 2008

I will try to describe the issue in more detail in separate post in
near future.

Here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a9513cf616dc168c

Dmitriy V'jukov

Chris M. Thomasson · Aug 7, 2008

Dmitriy V'jukov said:
Yes and no. No, I didn't test this particular algorithm. But I test my
analogous algorithm for work-stealing deque.
It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
it manually, I was trying to insert some asserts, I was trying to
insert some "yields", I was trying to review it several times, I was
trying to analyze dumps. Uselessly.
When I run it under Relacy Race Detector I localize and fix 2 subtle
algo errors and clarify some moments with memory fences it about 3
hours. And it was almost mechanical work, not some black magic.

Did you develop Relacy in part so that you can totally "deconstruct" your
work-stealing algorithm into a form in which its "random" run-time behavior
could be studied at another level and therefore be _reliably_ debugged?

;^D

Chris M. Thomasson · Aug 7, 2008

I'm not sure that I understand the algorithm here, but the reason this
doesn't introduce a synchronize-with relationship is that the fence
needs to follow the load, not precede it.

load.relaxed count
fence.seq_cst

does synchronize with

store.seq_cst count, v

when the load sees the value v written by the store.

If you do need a leading seq_cst fence, why not just use a seq_cst
load?

That can be done as long as the full fence is executed before the load of
the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
of my algorithm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380

Here is where Joe Seigh points out the membars:

http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b55557c0

AFAICT, the release barrier in a user algorithm is in the wrong place. The
load cannot be allowed to be hoisted up above the production of any item in
a user container class. Basically, the production of an item into a user
container usually only requires a release barrier. Its in the wrong place
wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
in order to allow the production to be completely visible BEFORE the load
from the eventcount is performed... Where am I going wrong?

Read here for more info:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

Humm...

Dmitriy V'jukov · Aug 7, 2008

That can be done as long as the full fence is executed before the load of
the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
of my algorithm:

http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Here is where Joe Seigh points out the membars:

http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b...

AFAICT, the release barrier in a user algorithm is in the wrong place. The
load cannot be allowed to be hoisted up above the production of any item in
a user container class. Basically, the production of an item into a user
container usually only requires a release barrier. Its in the wrong place
wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
in order to allow the production to be completely visible BEFORE the load
from the eventcount is performed... Where am I going wrong?

The problem with C++0x in this context is the following.
Assume container is lock-free stack. And 'production' is implemented
as seq_cst atomic RMW on 'head' variable. In this situation you can
reasonably presume that you can use 'relaxed signaling', i.e. relaxed
load of eventcount w/o any additional fences, because seq_cst atomic
RMW on 'head' variable already contains strong enough fence in right
place. WRONG! Why? See here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a9513cf616dc168c

Dmitriy V'jukov

Peter Dimov · Aug 7, 2008

AFAICT, the release barrier in a user algorithm is in the wrong place. The
load cannot be allowed to be hoisted up above the production of any item in
a user container class. Basically, the production of an item into a user
container usually only requires a release barrier. Its in the wrong place
wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
in order to allow the production to be completely visible BEFORE the load
from the eventcount is performed... Where am I going wrong?

Read here for more info:

http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Humm...

Hmmm.

In terms of the C++MM, the options I see are:

1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
get
3. RMW.seq_cst in signal, cas.seq_cst in get

and maybe

4. RMW.acq_rel in signal, cas.acq_rel in get

Chris M. Thomasson · Aug 7, 2008

In terms of the C++MM, the options I see are:

1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
get
3. RMW.seq_cst in signal, cas.seq_cst in get

and maybe

4. RMW.acq_rel in signal, cas.acq_rel in get

For now, I choose number 2 because in the current eventcount algorithm
as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
function needs a trailing StoreLoad. AFAICT, this is fine... As for the
other choices, well, I don't like number 1 because it moves memory
visibility rules directly into an end-user algorithm. I dislike 3-4 because
of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
eventcount state with an addend of zero. However, IMVHO, that looks like
fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
mutation is basically pointless...

Dmitriy V'jukov · Aug 7, 2008

For now, I choose number 2 because in the current eventcount algorithm
as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
function needs a trailing StoreLoad. AFAICT, this is fine... As for the
other choices, well, I don't like number 1 because it moves memory
visibility rules directly into an end-user algorithm. I dislike 3-4 because
of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
eventcount state with an addend of zero. However, IMVHO, that looks like
fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
mutation is basically pointless...

Consider implementation of number 2 on x86.
Producer:
1. Interlocked RMW to enqueue element
2. mfence in signal (even more expensive than Interlocked RMW, and
absolutely useless)

Consumer:
3. Interlocked RMW in get
4. mfence in get (even more expensive than Interlocked RMW, and
absolutely useless)

Maybe compiler can optimize 4 away, because 3 and 4 will be in one
function. But I am not sure that first versions of C++0x compilers
will be so smart. And it's questionable that compiler will be able to
optimize 2 away, because 1 and 2 will be in different functions and
probably in different source files.

Do you like this? Don't hasten forgetting your asm skills

Dmitriy V'jukov

Peter Dimov · Aug 7, 2008

Consider implementation of number 2 on x86.
Producer:
1. Interlocked RMW to enqueue element
2. mfence in signal (even more expensive than Interlocked RMW, and
absolutely useless)

It might be interesting to time the actual mfence overhead when the
preceding locked RMW instruction has already drained the store queue.

C++0x memory model and atomics, some questions	5	Sep 1, 2010
Synchronization problem	2	Nov 12, 2008
On alignment (final committee draft for C++0x and n1425 for C1X)	2	Aug 21, 2010
On alignment (final committee draft for C++0x and n1425 for C1X)	6	Aug 20, 2010
Threading question - read-read coherency almost, sort of, withseq_cst fences	8	Jun 16, 2011
working draft C++, program model	0	Oct 26, 2008
New interesting application of Transactional Memory forsingle-threading	3	Oct 23, 2008
string combination algorithm for a "word descrambler"	4	Jan 19, 2005

Synchronization Algorithm Verificator for C++0x

Dmitriy V'jukov

Dmitriy V'jukov

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Dmitriy V'jukov

Chris M. Thomasson

Dmitriy V'jukov

Dmitriy V'jukov

Dmitriy V'jukov

Peter Dimov

Dmitriy V'jukov

Dmitriy V'jukov

Chris M. Thomasson

Chris M. Thomasson

Dmitriy V'jukov

Peter Dimov

Chris M. Thomasson

Dmitriy V'jukov

Peter Dimov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads