Synchronization Algorithm Verificator for C++0x

Discussion in 'C++' started by Dmitriy V'jukov, Aug 2, 2008.

  1. I want to announce tool called Relacy Race Detector, which I've
    developed. It's synchronization algorithm verificator for C++0x's
    relaxed memory model. The tool is implemented in the form of header-
    only library for C++03, which can be used for efficient execution of
    unit-tests for synchronization algorithms. The tool executes unit-test
    many times under control of special scheduler, on every execution
    scheduler models different interleaving between threads, at the same
    time every execution is exhaustively analyzed for data races, accesses
    to freed memory, failed assertions etc. If no errors found then
    verification terminates when particular number of interleavings are
    verified (for random scheduler), or when all possible interleavings
    are verified (for full search scheduler). If error is found then tool
    outputs detailed execution history which leads to
    error and terminates.
    The tool was designed for verification of algorithms like memory
    management, memory reclamation for lock-free algorithms, multithreaded
    containers (queues, stacks, maps), mutexes, eventcounts and so on.
    My personal subjective feeling to date is that tool is capable of
    finding extremely subtle algorithmic errors, memory fence placement
    errors and memory fence type errors within a second. And after error
    is detected error fixing is relatively trivial, because one has
    detailed execution history which leads to error.

    Main features:
    - Relaxed ISO C++0x Memory Model. Relaxed/acquire/release/acq_rel/
    seq_cst memory operations and fences. The only non-supported feature
    is memory_order_consume, it's simulated with memory_order_acquire.
    - Exhaustive automatic error checking (including ABA detection).
    - Full-fledged atomics library (with spurious failures in
    compare_exchange()).
    - Arbitrary number of threads.
    - Detailed execution history for failed tests.
    - Before/after/invariant functions for test suites.
    - No false positives.

    Types of detectable errors:
    - Race condition (according to ISO C++0x definition)
    - Access to uninitialized variable
    - Access to freed memory
    - Double free
    - Memory leak
    - Deadlock
    - Livelock
    - User assert failed
    - User invariant failed

    You can get some more information (tutorial, examples etc) here:
    http://groups.google.com/group/relacy/web

    And here is dedicated news group/discussion forum:
    http://groups.google.com/group/relacy/topics

    If you want to get a copy of the tool, please, contact me via
    .

    Any feedback, comments, suggestions are welcome.

    Relacy Race Detector is free for open-source, non-commercial
    development; research with non-patented, published results;
    educational purposes; home/private usage. Please contact me
    () about other usages.


    --------------------------------------------------------
    Here is quick example.
    Code of unit-test:

    #include <relacy/relacy_std.hpp>

    // template parameter '2' is number of threads
    struct race_test : rl::test_suite<race_test, 2>
    {
    std::atomic<int> a;
    rl::var<int> x;

    // executed in single thread before main thread function
    void before()
    {
    a($) = 0;
    x($) = 0;
    }

    // main thread function
    void thread(unsigned thread_index)
    {
    if (0 == thread_index)
    {
    // code executed by first thread
    x($) = 1;
    a($).store(1, rl::memory_order_relaxed);
    }
    else
    {
    // code executed by second thread
    if (1 == a($).load(rl::memory_order_relaxed))
    x($) = 2;
    }
    }

    // executed in single thread after main thread function
    void after()
    {
    }

    // executed in single thread after every
    // 'visible' action in main threads.
    // disallowed to modify any state
    void invariant()
    {
    }
    };

    int main()
    {
    // start simulation
    rl::simulate<race_test>();
    }

    And here is output of the tool:

    struct race_test
    DATA RACE
    iteration: 8

    execution history:
    [0] 0: <00366538> atomic store, value=0, (prev value=0),
    order=seq_cst, in race_test::before, test.cpp(14)
    [1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
    [2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
    [3] 0: <00366538> atomic store, value=1, (prev value=0),
    order=relaxed, in race_test::thread, test.cpp(24)
    [4] 1: <00366538> atomic load, value=1, order=relaxed, in
    race_test::thread, test.cpp(28)
    [5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
    [6] 1: data race detected, in race_test::thread, test.cpp(29)

    thread 0:
    [0] 0: <00366538> atomic store, value=0, (prev value=0),
    order=seq_cst, in race_test::before, test.cpp(14)
    [1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
    [2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
    [3] 0: <00366538> atomic store, value=1, (prev value=0),
    order=relaxed, in race_test::thread, test.cpp(24)

    thread 1:
    [4] 1: <00366538> atomic load, value=1, order=relaxed, in
    race_test::thread, test.cpp(28)
    [5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
    [6] 1: data race detected, in race_test::thread, test.cpp(29)
    --------------------------------------------------------

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 2, 2008
    #1
    1. Advertising

  2. On 2 Á×Ç, 20:47, "Dmitriy V'jukov" <> wrote:

    > I want to announce tool called Relacy Race Detector, which I've
    > developed. It's synchronization algorithm verificator for C++0x's
    > relaxed memory model.



    Q: Can I use Relacy Race Detector to check my algo againts other that C
    ++0x memory models (x86, PPC, Java, CLI)?


    A Yes, you can. Fortunately, C++0x memory model is very relaxaed, so
    for the main part it's a "superset" of basically any other memory
    model. You just have to define "binding" between your target memory
    model and C++0x memory model.

    Let's create such binding, for example, for x86 memory model:
    - plain load operation is always "acquire" (i.e.
    memory_order_acquire)
    - plain store operation is always "release" (i.e.
    memory_order_release)
    - atomic RMW operation is always sequentially consistent (i.e.
    memory_order_seq_cst)
    - mfence instruction is
    std::atomic_thread_fence(memory_order_seq_cst)
    That is all. You can create such bindings for other hardware memory
    models you are interested in (PPC, Itatium, SPARC etc).

    And you can define such binding to other abstract memory model, like
    Java MM. Let's see:
    - plain load is "relaxed" (i.e. memory_order_relaxed)
    - plain store is "relaxed" (i.e. memory_order_relaxed)
    - volatile load is "acquire" (i.e. memory_order_acquire)
    - volatile store operation is "release" (i.e. memory_order_release)
    - atomic RMW operation is always sequentially consistent (i.e.
    memory_order_seq_cst)
    But here are some caveats. First, you have to emulate work of GC, i.e.
    put all allocated memory to special list, and free all allocated
    memory in test_suite::after(). Second, you have to manually emit
    sequentially consistent memory fence between every volatile store and
    volatile load. Third, you have to manually initialize all variables to
    default value (0). Fourth, there is no such thing as data race, so you
    have to define all variables as std::atomic, this will effectively
    disable data race detection mechanizm. Well, actually you can use
    rl::var variables, if you know that there must be no concurrent
    accesses to variable, this will enable some automatic error detection
    wrt data races.
    Sounds not very cool... So I'm going to add built-in support for Java
    and CLI. Then user would have to define something like RL_JAVA_MODE/
    RL_CLI_MODE, and get all those things out-of-the-box. But yes, it
    still will be C++ library. What do you think?

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 3, 2008
    #2
    1. Advertising

  3. "Dmitriy V'jukov" <> wrote in message
    news:...
    >I want to announce tool called Relacy Race Detector, which I've
    > developed. It's synchronization algorithm verificator for C++0x's
    > relaxed memory model. The tool is implemented in the form of header-
    > only library for C++03, which can be used for efficient execution of
    > unit-tests for synchronization algorithms. The tool executes unit-test
    > many times under control of special scheduler, on every execution
    > scheduler models different interleaving between threads, at the same
    > time every execution is exhaustively analyzed for data races, accesses
    > to freed memory, failed assertions etc. If no errors found then
    > verification terminates when particular number of interleavings are
    > verified (for random scheduler), or when all possible interleavings
    > are verified (for full search scheduler). If error is found then tool
    > outputs detailed execution history which leads to
    > error and terminates.
    > The tool was designed for verification of algorithms like memory
    > management, memory reclamation for lock-free algorithms, multithreaded
    > containers (queues, stacks, maps), mutexes, eventcounts and so on.
    > My personal subjective feeling to date is that tool is capable of
    > finding extremely subtle algorithmic errors, memory fence placement
    > errors and memory fence type errors within a second. And after error
    > is detected error fixing is relatively trivial, because one has
    > detailed execution history which leads to error.

    [...]

    Have you testing the following algorithm yet?

    http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

    I thought about coding it up. However, I think the algorithm might have a
    patent application.
    Chris M. Thomasson, Aug 5, 2008
    #3
  4. "Dmitriy V'jukov" <> wrote in message
    news:...
    >I want to announce tool called Relacy Race Detector, which I've
    > developed. It's synchronization algorithm verificator for C++0x's
    > relaxed memory model. The tool is implemented in the form of header-
    > only library for C++03, which can be used for efficient execution of
    > unit-tests for synchronization algorithms. The tool executes unit-test
    > many times under control of special scheduler, on every execution
    > scheduler models different interleaving between threads, at the same
    > time every execution is exhaustively analyzed for data races, accesses
    > to freed memory, failed assertions etc. If no errors found then
    > verification terminates when particular number of interleavings are
    > verified (for random scheduler), or when all possible interleavings
    > are verified (for full search scheduler). If error is found then tool
    > outputs detailed execution history which leads to
    > error and terminates.
    > The tool was designed for verification of algorithms like memory
    > management, memory reclamation for lock-free algorithms, multithreaded
    > containers (queues, stacks, maps), mutexes, eventcounts and so on.
    > My personal subjective feeling to date is that tool is capable of
    > finding extremely subtle algorithmic errors, memory fence placement
    > errors and memory fence type errors within a second. And after error
    > is detected error fixing is relatively trivial, because one has
    > detailed execution history which leads to error.


    [...]


    After you integrate a mutex and condvar into the detection framework, I
    would like to model an eventcount.

    Also, it seems like you could parallelize the testing process somewhat... I
    need to think about this some more.

    Anyway, nice work.


    BTW, you ask if there is any market for this type of tool in an e-mail, well
    I hope you don't mind if I answer that here...



    IMVOH, when C++ 0x is finally released, it should have a fairly big impact.
    I can see it now... You trademark a logo, and give a company the chance to
    stick it on their products to give their customers piece of mind:

    Customer: "Well, the software has the Relacy logo on the front of the box;
    it must be correct concurrent software indeed!"

    Vendor: "Are software lives up to the prerequisites of the Relacy logo. We
    are proud to be able to display it on our software packaging."


    I hope you make tons of $$$ with this tool Dmitriy. Get rich and, then a
    little richer; AFAICT its a winner.


    Any thoughts?


    =^D
    Chris M. Thomasson, Aug 5, 2008
    #4
  5. "Chris M. Thomasson" <> wrote in message
    news:gH4mk.11785$...
    > "Dmitriy V'jukov" <> wrote in message
    > news:...
    >>I want to announce tool called Relacy Race Detector, which I've
    >> developed.

    [...]
    >
    > [...]
    >
    > After you integrate a mutex and condvar into the detection framework, I
    > would like to model an eventcount.


    I would be interested to see how long it takes to detect the following bug
    you found in an older version of AppCore:

    http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

    [...]
    Chris M. Thomasson, Aug 5, 2008
    #5
  6. On 6 Á×Ç, 02:54, "Chris M. Thomasson" <> wrote:

    > I would be interested to see how long it takes to detect the following bug
    > you found in an older version of AppCore:
    >
    > http://groups.google.com/group/comp.programming.threads/browse_frm/th...


    Ok, let's see. I've just finished basic mutex and condvar
    implementation.

    Here is eventcount:

    class eventcount
    {
    public:
    eventcount()
    : count(0)
    , waiters(0)
    {}

    void signal_relaxed()
    {
    unsigned cmp = count($).load(std::memory_order_relaxed);
    signal_impl(cmp);
    }

    void signal()
    {
    unsigned cmp = count($).fetch_add(0, std::memory_order_seq_cst);
    signal_impl(cmp);
    }

    unsigned get()
    {
    unsigned cmp = count($).fetch_or(0x80000000,
    std::memory_order_seq_cst);
    return cmp & 0x7FFFFFFF;
    }

    void wait(unsigned cmp)
    {
    unsigned ec = count($).load(std::memory_order_seq_cst);
    if (cmp == (ec & 0x7FFFFFFF))
    {
    guard.lock($);
    ec = count($).load(std::memory_order_seq_cst);
    if (cmp == (ec & 0x7FFFFFFF))
    {
    waiters($) += 1;
    cv.wait(guard, $);
    }
    guard.unlock($);
    }
    }

    private:
    std::atomic<unsigned> count;
    rl::var<unsigned> waiters;
    mutex guard;
    condition_variable_any cv;

    void signal_impl(unsigned cmp)
    {
    if (cmp & 0x80000000)
    {
    guard.lock($);
    while (false == count($).compare_swap(cmp,
    (cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
    unsigned w = waiters($);
    waiters($) = 0;
    guard.unlock($);
    if (w)
    cv.notify_all($);
    }
    }
    };

    Here is the queue:

    template<typename T>
    class spsc_queue
    {
    public:
    spsc_queue()
    {
    node* n = RL_NEW node ();
    head($) = n;
    tail($) = n;
    }

    ~spsc_queue()
    {
    RL_ASSERT(head($) == tail($));
    RL_DELETE((node*)head($));
    }

    void enqueue(T data)
    {
    node* n = RL_NEW node (data);
    head($)->next($).store(n, std::memory_order_release);
    head($) = n;
    ec.signal_relaxed();
    }

    T dequeue()
    {
    T data = try_dequeue();
    while (0 == data)
    {
    int cmp = ec.get();
    data = try_dequeue();
    if (data)
    break;
    ec.wait(cmp);
    data = try_dequeue();
    if (data)
    break;
    }
    return data;
    }

    private:
    struct node
    {
    std::atomic<node*> next;
    rl::var<T> data;

    node(T data = T())
    : next(0)
    , data(data)
    {}
    };

    rl::var<node*> head;
    rl::var<node*> tail;

    eventcount ec;

    T try_dequeue()
    {
    node* t = tail($);
    node* n = t->next($).load(std::memory_order_acquire);
    if (0 == n)
    return 0;
    T data = n->data($);
    RL_DELETE(t);
    tail($) = n;
    return data;
    }
    };

    Here is the test:

    struct spsc_queue_test : rl::test_suite<spsc_queue_test, 2>
    {
    spsc_queue<int> q;

    void thread(unsigned thread_index)
    {
    if (0 == thread_index)
    {
    q.enqueue(11);
    }
    else
    {
    int d = q.dequeue();
    RL_ASSERT(11 == d);
    }
    }
    };

    int main()
    {
    rl::simulate<spsc_queue_test>();
    }

    It takes around 40 minutes.

    Let's run it!
    Here is output (note error detected on iteration 3):

    struct spsc_queue_test
    DEADLOCK
    iteration: 3

    execution history:
    [0] 1: [BEFORE BEGIN], in rl::context_impl<struct
    spsc_queue_test,class rl::random_scheduler<2> >::fiber_proc_impl,
    context.hpp(341)
    [1] 1: <0035398C> atomic store, value=-1, (prev value=0),
    order=relaxed, in rl::generic_mutex<0>::generic_mutex, stdlib/
    mutex.hpp(77)
    [2] 1: memory allocation: addr=00353A80, size=52, in
    spsc_queue<int>::spsc_queue, peterson.cpp(397)
    [3] 1: <00353938> store, value=00353A80, in
    spsc_queue<int>::spsc_queue, peterson.cpp(398)
    [4] 1: <00353948> store, value=00353A80, in
    spsc_queue<int>::spsc_queue, peterson.cpp(399)
    [5] 1: [BEFORE END], in rl::context_impl<struct spsc_queue_test,class
    rl::random_scheduler<2> >::fiber_proc_impl, context.hpp(356)
    [6] 1: <00353948> load, value=00353A80, in
    spsc_queue<int>::try_dequeue, peterson.cpp(452)
    [7] 1: <00353A80> atomic load, value=00000000, order=acquire, in
    spsc_queue<int>::try_dequeue, peterson.cpp(453)
    [8] 1: <00353958> fetch_or , prev=0, op=2147483648, new=2147483648,
    order=seq_cst, in eventcount::get, peterson.cpp(349)
    [9] 1: <00353948> load, value=00353A80, in
    spsc_queue<int>::try_dequeue, peterson.cpp(452)
    [10] 0: memory allocation: addr=0035BCA0, size=52, in
    spsc_queue<int>::enqueue, peterson.cpp(410)
    [11] 0: <00353938> load, value=00353A80, in spsc_queue<int>::enqueue,
    peterson.cpp(411)
    [12] 1: <00353A80> atomic load, value=00000000, order=acquire, in
    spsc_queue<int>::try_dequeue, peterson.cpp(453)
    [13] 0: <00353A80> atomic store, value=0035BCA0, (prev
    value=00000000), order=release, in spsc_queue<int>::enqueue,
    peterson.cpp(411)
    [14] 0: <00353938> store, value=0035BCA0, in spsc_queue<int>::enqueue,
    peterson.cpp(412)
    [15] 0: <00353958> atomic load, value=0 [NOT CURRENT], order=relaxed,
    in eventcount::signal_relaxed, peterson.cpp(337)
    [16] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
    eventcount::wait, peterson.cpp(355)
    [17] 1: <0035398C> CAS succ orig=-1, cmp=-1, xchg=1, order=acquire, in
    rl::generic_mutex<0>::lock, stdlib/mutex.hpp(117)
    [18] 1: <0035398C> mutex: lock, in eventcount::wait, peterson.cpp(358)
    [19] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
    eventcount::wait, peterson.cpp(359)
    [20] 1: <0035397C> load, value=0, in eventcount::wait,
    peterson.cpp(362)
    [21] 1: <0035397C> store, value=1, in eventcount::wait,
    peterson.cpp(362)
    [22] 1: <0035398C> atomic store, value=-1, (prev value=1),
    order=release, in rl::generic_mutex<0>::unlock, stdlib/mutex.hpp(209)
    [23] 1: <0035398C> mutex: unlock, in eventcount::wait,
    peterson.cpp(363)
    [24] 1: blocking thread, in eventcount::wait, peterson.cpp(363)
    [25] 1: deadlock detected, in eventcount::wait, peterson.cpp(363)

    thread 0:
    [10] 0: memory allocation: addr=0035BCA0, size=52, in
    spsc_queue<int>::enqueue, peterson.cpp(410)
    [11] 0: <00353938> load, value=00353A80, in spsc_queue<int>::enqueue,
    peterson.cpp(411)
    [13] 0: <00353A80> atomic store, value=0035BCA0, (prev
    value=00000000), order=release, in spsc_queue<int>::enqueue,
    peterson.cpp(411)
    [14] 0: <00353938> store, value=0035BCA0, in spsc_queue<int>::enqueue,
    peterson.cpp(412)
    [15] 0: <00353958> atomic load, value=0 [NOT CURRENT], order=relaxed,
    in eventcount::signal_relaxed, peterson.cpp(337)

    thread 1:
    [0] 1: [BEFORE BEGIN], in rl::context_impl<struct
    spsc_queue_test,class rl::random_scheduler<2> >::fiber_proc_impl,
    context.hpp(341)
    [1] 1: <0035398C> atomic store, value=-1, (prev value=0),
    order=relaxed, in rl::generic_mutex<0>::generic_mutex, stdlib/
    mutex.hpp(77)
    [2] 1: memory allocation: addr=00353A80, size=52, in
    spsc_queue<int>::spsc_queue, peterson.cpp(397)
    [3] 1: <00353938> store, value=00353A80, in
    spsc_queue<int>::spsc_queue, peterson.cpp(398)
    [4] 1: <00353948> store, value=00353A80, in
    spsc_queue<int>::spsc_queue, peterson.cpp(399)
    [5] 1: [BEFORE END], in rl::context_impl<struct spsc_queue_test,class
    rl::random_scheduler<2> >::fiber_proc_impl, context.hpp(356)
    [6] 1: <00353948> load, value=00353A80, in
    spsc_queue<int>::try_dequeue, peterson.cpp(452)
    [7] 1: <00353A80> atomic load, value=00000000, order=acquire, in
    spsc_queue<int>::try_dequeue, peterson.cpp(453)
    [8] 1: <00353958> fetch_or , prev=0, op=2147483648, new=2147483648,
    order=seq_cst, in eventcount::get, peterson.cpp(349)
    [9] 1: <00353948> load, value=00353A80, in
    spsc_queue<int>::try_dequeue, peterson.cpp(452)
    [12] 1: <00353A80> atomic load, value=00000000, order=acquire, in
    spsc_queue<int>::try_dequeue, peterson.cpp(453)
    [16] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
    eventcount::wait, peterson.cpp(355)
    [17] 1: <0035398C> CAS succ orig=-1, cmp=-1, xchg=1, order=acquire, in
    rl::generic_mutex<0>::lock, stdlib/mutex.hpp(117)
    [18] 1: <0035398C> mutex: lock, in eventcount::wait, peterson.cpp(358)
    [19] 1: <00353958> atomic load, value=2147483648, order=seq_cst, in
    eventcount::wait, peterson.cpp(359)
    [20] 1: <0035397C> load, value=0, in eventcount::wait,
    peterson.cpp(362)
    [21] 1: <0035397C> store, value=1, in eventcount::wait,
    peterson.cpp(362)
    [22] 1: <0035398C> atomic store, value=-1, (prev value=1),
    order=release, in rl::generic_mutex<0>::unlock, stdlib/mutex.hpp(209)
    [23] 1: <0035398C> mutex: unlock, in eventcount::wait,
    peterson.cpp(363)
    [24] 1: blocking thread, in eventcount::wait, peterson.cpp(363)
    [25] 1: deadlock detected, in eventcount::wait, peterson.cpp(363)

    Operations on condition variables are not yet in execution history.

    Let's replace 'ec.signal_relaxed()' with 'ec.signal()'. Here is the
    output:

    struct spsc_queue_test
    6% (65536/1000000)
    13% (131072/1000000)
    19% (196608/1000000)
    26% (262144/1000000)
    32% (327680/1000000)
    39% (393216/1000000)
    45% (458752/1000000)
    52% (524288/1000000)
    58% (589824/1000000)
    65% (655360/1000000)
    72% (720896/1000000)
    78% (786432/1000000)
    85% (851968/1000000)
    91% (917504/1000000)
    98% (983040/1000000)
    iterations: 1000000
    total time: 2422
    throughput: 412881

    Test passed. Throughput is around 400'000 test executions per second.

    Also test reveals some errors with memory fences.
    In signaling function you are using acquire fence in cas, but relaxed
    is enough here:
    void signal_impl(unsigned cmp)
    {
    if (cmp & 0x80000000)
    {
    guard.lock($);
    while (false == count($).compare_swap(cmp,
    (cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
    unsigned w = waiters($);
    waiters($) = 0;
    guard.unlock($);
    if (w)
    cv.notify_all($);
    }
    }

    Also. Following naive approach doesn't work (at least in C++0x):
    void signal()
    {
    std::atomic_thread_fence(std::memory_order_seq_cst);
    signal_relaxed();
    }
    You have to execute sequentially consistent atomic RMW on 'ec'
    variable here. Because sequentially consistent fence and sequentially
    consistent atomic RMW operation doesn' t synchronize-with each other.
    Hmmm... This is strange. But I can't find anything about this in
    draft... Anthony Williams said that committee made some changes to
    initial proposal on fences, but they are not yet published...


    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 6, 2008
    #6
  7. "Dmitriy V'jukov" <> wrote in message
    news:...
    On 6 Á×Ç, 02:54, "Chris M. Thomasson" <> wrote:

    > > I would be interested to see how long it takes to detect the following
    > > bug
    > > you found in an older version of AppCore:
    > >
    > > http://groups.google.com/group/comp.programming.threads/browse_frm/th...


    > Ok, let's see. I've just finished basic mutex and condvar
    > implementation.


    Can you e-mail the new version to me please?




    > Here is eventcount:


    class eventcount
    {
    [...]
    void wait(unsigned cmp)
    {
    unsigned ec = count($).load(std::memory_order_seq_cst);
    if (cmp == (ec & 0x7FFFFFFF))
    {
    guard.lock($);
    ec = count($).load(std::memory_order_seq_cst);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    I don't think that you need a memory barrier under the lock because the
    only mutation which will make the following compare succeed is also
    performed under the same lock. The signal-slow path is under the lock, and
    the waiter slow-path is under the lock. However, I don't know if this
    violates some rule in C++ 0x.

    if (cmp == (ec & 0x7FFFFFFF))
    {
    waiters($) += 1;
    cv.wait(guard, $);
    }
    guard.unlock($);
    }
    }
    };

    > Here is the queue:


    [...]

    > Here is the test:


    [...]

    > It takes around 40 minutes.


    > Let's run it!
    > Here is output (note error detected on iteration 3):


    > struct spsc_queue_test
    > DEADLOCK
    > iteration: 3


    Cool; it sure does detect the race-condition in the old AppCore! Very nice
    indeed.



    [...]

    > Operations on condition variables are not yet in execution history.


    > Let's replace 'ec.signal_relaxed()' with 'ec.signal()'. Here is the
    > output:


    > struct spsc_queue_test

    [...]
    > iterations: 1000000
    > total time: 2422
    > throughput: 412881


    > Test passed. Throughput is around 400'000 test executions per second.


    Beautiful. My eventcount algorihtm works!

    =^D




    > Also test reveals some errors with memory fences.
    > In signaling function you are using acquire fence in cas, but relaxed
    > is enough here:
    > void signal_impl(unsigned cmp)
    > {
    > if (cmp & 0x80000000)
    > {
    > guard.lock($);
    > while (false == count($).compare_swap(cmp,
    > (cmp + 1) & 0x7FFFFFFF, std::memory_order_relaxed));
    > unsigned w = waiters($);
    > waiters($) = 0;
    > guard.unlock($);
    > if (w)
    > cv.notify_all($);
    > }
    > }


    Right. The current public AppCore atomic API is not fine-grain enough to
    allow for relaxed operations. I need to change that.




    > Also. Following naive approach doesn't work (at least in C++0x):
    > void signal()
    > {
    > std::atomic_thread_fence(std::memory_order_seq_cst);
    > signal_relaxed();
    > }


    Good thing the MFENCE works in x86!




    > You have to execute sequentially consistent atomic RMW on 'ec'
    > variable here. Because sequentially consistent fence and sequentially
    > consistent atomic RMW operation doesn' t synchronize-with each other.
    > Hmmm... This is strange. But I can't find anything about this in
    > draft... Anthony Williams said that committee made some changes to
    > initial proposal on fences, but they are not yet published...


    This is weird. a seq_cst fence does not work with seq_cst RMW!? Do you
    happen to know the rational for this? IMVHO, it makes no sense at all... I
    must be missing something.

    :^/
    Chris M. Thomasson, Aug 6, 2008
    #7
  8. On Aug 6, 12:23 am, "Chris M. Thomasson" <> wrote:

    > Have you testing the following algorithm yet?
    >
    > http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf


    Yes and no. No, I didn't test this particular algorithm. But I test my
    analogous algorithm for work-stealing deque.
    It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
    hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
    it manually, I was trying to insert some asserts, I was trying to
    insert some "yields", I was trying to review it several times, I was
    trying to analyze dumps. Uselessly.
    When I run it under Relacy Race Detector I localize and fix 2 subtle
    algo errors and clarify some moments with memory fences it about 3
    hours. And it was almost mechanical work, not some black magic.

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 6, 2008
    #8
  9. Dmitriy V'jukov, Aug 6, 2008
    #9
  10. On Aug 6, 2:46 am, "Chris M. Thomasson" <> wrote:

    > IMVOH, when C++ 0x is finally released, it should have a fairly big impact.
    > I can see it now... You trademark a logo, and give a company the chance to
    > stick it on their products to give their customers piece of mind:
    >
    > Customer: "Well, the software has the Relacy logo on the front of the box;
    > it must be correct concurrent software indeed!"
    >
    > Vendor: "Are software lives up to the prerequisites of the Relacy logo. We
    > are proud to be able to display it on our software packaging."
    >
    > I hope you make tons of $$$ with this tool Dmitriy. Get rich and, then a
    > little richer; AFAICT its a winner.
    >
    > Any thoughts?
    >
    > =^D


    :)
    Sounds great!

    I think I can start developing 2 logos:
    [RELACY APPROVED]
    and the second:
    [RELACY BUSTED]
    :)

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 6, 2008
    #10
  11. Dmitriy V'jukov

    Peter Dimov Guest

    On Aug 6, 5:31 am, "Dmitriy V'jukov" <> wrote:

    > Also. Following naive approach doesn't work (at least in C++0x):
    >         void signal()
    >         {
    >                 std::atomic_thread_fence(std::memory_order_seq_cst);
    >                 signal_relaxed();
    >         }


    I'm not sure that I understand the algorithm here, but the reason this
    doesn't introduce a synchronize-with relationship is that the fence
    needs to follow the load, not precede it.

    load.relaxed count
    fence.seq_cst

    does synchronize with

    store.seq_cst count, v

    when the load sees the value v written by the store.

    If you do need a leading seq_cst fence, why not just use a seq_cst
    load?
    Peter Dimov, Aug 6, 2008
    #11
  12. On Aug 6, 5:08 pm, Peter Dimov <> wrote:
    > On Aug 6, 5:31 am, "Dmitriy V'jukov" <> wrote:
    >
    > > Also. Following naive approach doesn't work (at least in C++0x):
    > > void signal()
    > > {
    > > std::atomic_thread_fence(std::memory_order_seq_cst);
    > > signal_relaxed();
    > > }

    >
    > I'm not sure that I understand the algorithm here, but the reason this
    > doesn't introduce a synchronize-with relationship is that the fence
    > needs to follow the load, not precede it.


    I will try to describe the issue in more detail in separate post in
    near future.

    > load.relaxed count
    > fence.seq_cst
    >
    > does synchronize with
    >
    > store.seq_cst count, v
    >
    > when the load sees the value v written by the store.
    >
    > If you do need a leading seq_cst fence, why not just use a seq_cst
    > load?


    Because this seq_cst fence must introduce synchronizes-with not only
    from second thread to this thread, but also possibly it must introduce
    synchronizes-with from this thread to second thread. I.e. OR first
    thread must synchronize-with second thread OR second thread must
    synchronize-with first thread (depending on which thread's seq_cst
    operations is first in global order).
    If I use seq_cst load, then it only possible that second thread
    synchronizes-with this thread, but not vice versa, because seq_cst
    load is only acquire, it's NOT release.

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 6, 2008
    #12
  13. On Aug 6, 5:42 pm, "Dmitriy V'jukov" <> wrote:
    > On Aug 6, 5:08 pm, Peter Dimov <> wrote:
    >
    > > On Aug 6, 5:31 am, "Dmitriy V'jukov" <> wrote:

    >
    > > > Also. Following naive approach doesn't work (at least in C++0x):
    > > > void signal()
    > > > {
    > > > std::atomic_thread_fence(std::memory_order_seq_cst);
    > > > signal_relaxed();
    > > > }

    >
    > > I'm not sure that I understand the algorithm here, but the reason this
    > > doesn't introduce a synchronize-with relationship is that the fence
    > > needs to follow the load, not precede it.

    >
    > I will try to describe the issue in more detail in separate post in
    > near future.


    Here:
    http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a9513cf616dc168c

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 6, 2008
    #13
  14. "Dmitriy V'jukov" <> wrote in message
    news:...
    > On Aug 6, 12:23 am, "Chris M. Thomasson" <> wrote:
    >
    >> Have you testing the following algorithm yet?
    >>
    >> http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

    >
    > Yes and no. No, I didn't test this particular algorithm. But I test my
    > analogous algorithm for work-stealing deque.
    > It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
    > hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
    > it manually, I was trying to insert some asserts, I was trying to
    > insert some "yields", I was trying to review it several times, I was
    > trying to analyze dumps. Uselessly.
    > When I run it under Relacy Race Detector I localize and fix 2 subtle
    > algo errors and clarify some moments with memory fences it about 3
    > hours. And it was almost mechanical work, not some black magic.


    Did you develop Relacy in part so that you can totally "deconstruct" your
    work-stealing algorithm into a form in which its "random" run-time behavior
    could be studied at another level and therefore be _reliably_ debugged?

    ;^D
    Chris M. Thomasson, Aug 7, 2008
    #14
  15. "Peter Dimov" <> wrote in message
    news:...
    On Aug 6, 5:31 am, "Dmitriy V'jukov" <> wrote:

    > > Also. Following naive approach doesn't work (at least in C++0x):
    > > void signal()
    > > {
    > > std::atomic_thread_fence(std::memory_order_seq_cst);
    > > signal_relaxed();
    > > }


    > I'm not sure that I understand the algorithm here, but the reason this
    > doesn't introduce a synchronize-with relationship is that the fence
    > needs to follow the load, not precede it.


    > load.relaxed count
    > fence.seq_cst


    > does synchronize with


    > store.seq_cst count, v


    > when the load sees the value v written by the store.


    > If you do need a leading seq_cst fence, why not just use a seq_cst
    > load?


    That can be done as long as the full fence is executed before the load of
    the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
    of my algorithm:

    http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380

    Here is where Joe Seigh points out the membars:

    http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b55557c0


    AFAICT, the release barrier in a user algorithm is in the wrong place. The
    load cannot be allowed to be hoisted up above the production of any item in
    a user container class. Basically, the production of an item into a user
    container usually only requires a release barrier. Its in the wrong place
    wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
    in order to allow the production to be completely visible BEFORE the load
    from the eventcount is performed... Where am I going wrong?

    Read here for more info:

    http://groups.google.com/group/comp.programming.threads/browse_frm/thread/cf4a952ff5af7d

    Humm...
    Chris M. Thomasson, Aug 7, 2008
    #15
  16. On Aug 7, 3:09 pm, "Chris M. Thomasson" <> wrote:

    > That can be done as long as the full fence is executed before the load of
    > the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
    > of my algorithm:
    >
    > http://groups.google.com/group/comp.programming.threads/browse_frm/th...
    >
    > Here is where Joe Seigh points out the membars:
    >
    > http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b...
    >
    > AFAICT, the release barrier in a user algorithm is in the wrong place. The
    > load cannot be allowed to be hoisted up above the production of any item in
    > a user container class. Basically, the production of an item into a user
    > container usually only requires a release barrier. Its in the wrong place
    > wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
    > in order to allow the production to be completely visible BEFORE the load
    > from the eventcount is performed... Where am I going wrong?



    The problem with C++0x in this context is the following.
    Assume container is lock-free stack. And 'production' is implemented
    as seq_cst atomic RMW on 'head' variable. In this situation you can
    reasonably presume that you can use 'relaxed signaling', i.e. relaxed
    load of eventcount w/o any additional fences, because seq_cst atomic
    RMW on 'head' variable already contains strong enough fence in right
    place. WRONG! Why? See here:
    http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a9513cf616dc168c


    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 7, 2008
    #16
  17. Dmitriy V'jukov

    Peter Dimov Guest

    On Aug 7, 2:09 pm, "Chris M. Thomasson" <> wrote:

    > AFAICT, the release barrier in a user algorithm is in the wrong place. The
    > load cannot be allowed to be hoisted up above the production of any item in
    > a user container class. Basically, the production of an item into a user
    > container usually only requires a release barrier. Its in the wrong place
    > wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
    > in order to allow the production to be completely visible BEFORE the load
    > from the eventcount is performed... Where am I going wrong?
    >
    > Read here for more info:
    >
    > http://groups.google.com/group/comp.programming.threads/browse_frm/th...
    >
    > Humm...


    Hmmm.

    In terms of the C++MM, the options I see are:

    1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
    2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
    get
    3. RMW.seq_cst in signal, cas.seq_cst in get

    and maybe

    4. RMW.acq_rel in signal, cas.acq_rel in get
    Peter Dimov, Aug 7, 2008
    #17
  18. "Peter Dimov" <> wrote in message
    news:...
    On Aug 7, 2:09 pm, "Chris M. Thomasson" <> wrote:

    > > AFAICT, the release barrier in a user algorithm is in the wrong place.
    > > The
    > > load cannot be allowed to be hoisted up above the production of any item
    > > in
    > > a user container class. Basically, the production of an item into a user
    > > container usually only requires a release barrier. Its in the wrong
    > > place
    > > wrt coupling the algorithm with my eventcount. A StoreLoad needs to be
    > > there
    > > in order to allow the production to be completely visible BEFORE the
    > > load
    > > from the eventcount is performed... Where am I going wrong?
    > >
    > > Read here for more info:
    > >
    > > http://groups.google.com/group/comp.programming.threads/browse_frm/th...
    > >
    > > Humm...


    > Hmmm.


    > In terms of the C++MM, the options I see are:


    > 1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
    > 2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
    > get
    > 3. RMW.seq_cst in signal, cas.seq_cst in get


    > and maybe


    > 4. RMW.acq_rel in signal, cas.acq_rel in get


    For now, I choose number 2 because in the current eventcount algorithm
    as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
    function needs a trailing StoreLoad. AFAICT, this is fine... As for the
    other choices, well, I don't like number 1 because it moves memory
    visibility rules directly into an end-user algorithm. I dislike 3-4 because
    of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
    eventcount state with an addend of zero. However, IMVHO, that looks like
    fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
    mutation is basically pointless...
    Chris M. Thomasson, Aug 7, 2008
    #18
  19. On Aug 7, 7:13 pm, "Chris M. Thomasson" <> wrote:

    > > In terms of the C++MM, the options I see are:
    > > 1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
    > > 2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
    > > get
    > > 3. RMW.seq_cst in signal, cas.seq_cst in get
    > > and maybe
    > > 4. RMW.acq_rel in signal, cas.acq_rel in get

    >
    > For now, I choose number 2 because in the current eventcount algorithm
    > as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
    > function needs a trailing StoreLoad. AFAICT, this is fine... As for the
    > other choices, well, I don't like number 1 because it moves memory
    > visibility rules directly into an end-user algorithm. I dislike 3-4 because
    > of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
    > eventcount state with an addend of zero. However, IMVHO, that looks like
    > fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
    > mutation is basically pointless...



    Consider implementation of number 2 on x86.
    Producer:
    1. Interlocked RMW to enqueue element
    2. mfence in signal (even more expensive than Interlocked RMW, and
    absolutely useless)

    Consumer:
    3. Interlocked RMW in get
    4. mfence in get (even more expensive than Interlocked RMW, and
    absolutely useless)

    Maybe compiler can optimize 4 away, because 3 and 4 will be in one
    function. But I am not sure that first versions of C++0x compilers
    will be so smart. And it's questionable that compiler will be able to
    optimize 2 away, because 1 and 2 will be in different functions and
    probably in different source files.

    Do you like this? Don't hasten forgetting your asm skills ;)

    Dmitriy V'jukov
    Dmitriy V'jukov, Aug 7, 2008
    #19
  20. Dmitriy V'jukov

    Peter Dimov Guest

    On Aug 7, 6:31 pm, "Dmitriy V'jukov" <> wrote:

    > Consider implementation of number 2 on x86.
    > Producer:
    > 1. Interlocked RMW to enqueue element
    > 2. mfence in signal (even more expensive than Interlocked RMW, and
    > absolutely useless)


    It might be interesting to time the actual mfence overhead when the
    preceding locked RMW instruction has already drained the store queue.
    Peter Dimov, Aug 7, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. arcvonz

    System Time synchronization

    arcvonz, Aug 23, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    475
    Sam Santiago
    Aug 23, 2004
  2. John

    Flip Flop Synchronization

    John, Jan 3, 2004, in forum: VHDL
    Replies:
    3
    Views:
    9,652
    valentin tihomirov
    Jan 5, 2004
  3. Simon
    Replies:
    1
    Views:
    609
    tbx135
    Jan 9, 2004
  4. Ahmed Moustafa
    Replies:
    0
    Views:
    766
    Ahmed Moustafa
    Nov 15, 2003
  5. Bapaiah Katepalli
    Replies:
    1
    Views:
    1,483
    Mike Treseler
    Jun 23, 2006
Loading...

Share This Page