STL Queue and multithreading

naveen · Nov 9, 2010

Hi guys
For STL Queue, I would like to analyze following scenerio
1. Multiple threads writting to the Q and multiple threads reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
2. Multiple threads writting to the Q and **single** thread reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
3. Single thread writting to the Q and single thread( different than
the first one) reading from it. Is the mutex needed in this case
also ? or we can live without mutex.

I would like to minimize the mutex usage, please go through the
scenerios written above.
If mutex is needed in all the cases, can you suggest me some other
data structure where message can be written and read without locks or
with minimum locking
thanks
Naveen

Anthony Williams · Nov 9, 2010

naveen said:
For STL Queue, I would like to analyze following scenerio
1. Multiple threads writting to the Q and multiple threads reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
2. Multiple threads writting to the Q and **single** thread reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
3. Single thread writting to the Q and single thread( different than
the first one) reading from it. Is the mutex needed in this case
also ? or we can live without mutex.

For a std::queue<T> you need a mutex in all cases. None of the C++
standard library types can be modified by multiple threads safely
without a mutex, apart from a few documented cases in the C++0x library

(e.g. std::atomic said:
If mutex is needed in all the cases, can you suggest me some other
data structure where message can be written and read without locks or
with minimum locking

If you want no locks, search for "lock free queue". There are several
around, and some have been posted to this newsgroup.

For lock-free stuff, the use cases are important --- the "best"
structure for each case you outlined above is likely to be different.

Anthony

Goran Pusic · Nov 9, 2010

Hi guys
For STL Queue, I would like to analyze following scenerio
1. Multiple threads writting to the Q and multiple threads reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
2. Multiple threads writting to the Q and **single** thread reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
3. Single thread writting to the Q and single thread( different than
the first one) reading from it. Is the mutex needed in this case
also ? or we can live without mutex.

You need synchronization all the time.

I would like to minimize the mutex usage

Minimize? Why? What performance numbers you have to back that? Do you
have heavy synchronization that breaks your speed? If so, you should
be looking for a mutex that matches your usage pattern. You should be
looking at various read/write lock implementations. You should also be
able to find ready-made containers, too.

Goran.

Juha Nieminen · Nov 9, 2010

Goran Pusic said:
Minimize? Why? What performance numbers you have to back that?

It's a well-known fact that software mutexes are extremely slow and
can seriously impact performance in situations where the data container
needs to be accessed a very large amount of times very fast.

Fred Zwarts · Nov 10, 2010

Juha Nieminen said:
It's a well-known fact that software mutexes are extremely slow and
can seriously impact performance in situations where the data
container needs to be accessed a very large amount of times very fast.

Can you back this up?
My experience with posix mutexes is very different.
Normally the time for (un)locking a mutex is negligible compared to other processing,
(except in the case that locking results in a wait for another thread
or unlocking results in waking anonther thread).

James Kanze · Nov 10, 2010

It's a well-known fact that software mutexes are extremely
slow and can seriously impact performance in situations where
the data container needs to be accessed a very large amount of
times very fast.

Beware of "well known facts" that aren't backed by actual
measures. I don't know about Windows, but under Solaris, an
uncontested pthread_mutex_lock is not significantly slower than
any of the alternatives, at least in the benchmarks I've run.
From what I understand, Linux and CriticalSection under Windows
use a similar technique, avoiding a switch to system state
unless there is contention.

Chris M. Thomasson · Nov 10, 2010

Juha Nieminen said:
It's a well-known fact that software mutexes are extremely slow and
can seriously impact performance in situations where the data container
needs to be accessed a very large amount of times very fast.

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations and 2 memory
barriers, one of which may very well be a #StoreLoad. All of that can be
fairly expensive...

Also, if you don't carefully layout the mutex data-structures in memory, you
can suffer from performance destroying false-sharing... For instance, it's
probably a good idea to pad a `CRITICAL_SECTION' data-structure in Windows.
Something like:
____________________________________________________
#define L2CACHELINE 128

union CRITICAL_SECTION_PAD
{
CRITICAL_SECTION m_lock;
char l2cache_pad[L2CACHELINE];
};
____________________________________________________

Then you have to make sure that this structure is aligned on a cache line
boundary in memory. Sometimes it helps to strive to store the actual shared
data and the mutex data-structure within a single cache line. Something
like:
____________________________________________________
#define L2CACHELINE 128

struct shared_data
{
char m_var0;
int m_var1;
int m_var2;
};

union SHARED_DATA_CS_PAD
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
char l2cache_pad[L2CACHELINE];
};
____________________________________________________

Using user-space mutexs efficiently can be "tricky"...

;^)

Jorgen Grahn · Nov 10, 2010

Beware of "well known facts" that aren't backed by actual
measures. I don't know about Windows, but under Solaris, an
uncontested pthread_mutex_lock is not significantly slower than
any of the alternatives, at least in the benchmarks I've run.

What alternatives do you mean? I think the proper question here is
more like "what useful work could I have done in the time I used to
lock and unlock mutexes"?

But the answer is (you surely agree): "If it really matters to you,
benchmark".

/Jorgen

James Kanze · Nov 11, 2010

What alternatives do you mean?

I was thinking of various lock free algorithms, using atomic
modifications (CAS, atomic increment, etc.). These all require
some sort of fence or membar, which is all that
pthread_mutex_lock does if there is no contention.

I think the proper question here is
more like "what useful work could I have done in the time I used to
lock and unlock mutexes"?

*IF* there is a performance problem, the question is: what are
the alternatives, and will they be faster? But that's a big if.

But the answer is (you surely agree): "If it really matters to you,
benchmark".

Well, I'd start by assuming that it doesn't matter, until proven
otherwise

. Even benchmarking is unnecessary extra work if
the code is fast enough when written normally. (But of course,
if it isn't, you don't do anything until you've profiled.)

James Kanze · Nov 11, 2010

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...

Compared to what? (And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)

The memory barriers are expensive compared to any single
instruction, but in most programs, you shouldn't require the
lock very often. And they aren't expensive, or even
significant, compared to any other important calculations.

Chris M. Thomasson · Nov 11, 2010

James Kanze said:
[...]

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...

Click to expand...

Compared to what?

Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.

(And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)

Nearly all of the mutex implementations I have checked out/disassembled have
two atomic RMW operations. One for locking, and the other one for unlocking.

The memory barriers are expensive compared to any single
instruction, but in most programs, you shouldn't require the
lock very often. And they aren't expensive, or even
significant, compared to any other important calculations.

The `#StoreLoad' style memory barrier is fairly expensive. It forces a
strict ordering and basically ruins any accumulated cache that the executing
processor may have had...

Chris M. Thomasson · Nov 11, 2010

Chris M. Thomasson said:
James Kanze said:

On Nov 10, 10:26 pm, "Chris M. Thomasson" <[email protected]> wrote:

Click to expand...

[...]

(And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)

Click to expand...

Nearly all of the mutex implementations I have checked out/disassembled
have two atomic RMW operations. One for locking, and the other one for
unlocking.

For instance, take a look at the following code snippet from OpenSolaris:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/atomic.h
________________________________________________________
/*
234 * Generic memory barrier used during lock entry, placed after the
235 * memory operation that acquires the lock to guarantee that the
lock
236 * protects its data. No stores from after the memory barrier will
237 * reach visibility, and no loads from after the barrier will be
238 * resolved, before the lock acquisition reaches global visibility.
239 */
240 extern void membar_enter(void);
241
242 /*
243 * Generic memory barrier used during lock exit, placed before the
244 * memory operation that releases the lock to guarantee that the
lock
245 * protects its data. All loads and stores issued before the
barrier
246 * will be resolved before the subsequent lock update reaches
visibility.
247 */
248 extern void membar_exit(void);
________________________________________________________

Now, take a look at how they are implemented on a SPARC V9:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/common/atomic/sparcv9/atomic.s#915
________________________________________________________
ENTRY(membar_enter)
916 membar #StoreLoad|#StoreStore
917 retl
918 nop
919 SET_SIZE(membar_enter)
920
921 ENTRY(membar_exit)
922 membar #LoadStore|#StoreStore
923 retl
924 nop
925 SET_SIZE(membar_exit)
________________________________________________________

That #StoreLoad ruins performance! Ouch... ;^o

Even expensive on x86. Take a look at how they are implmented on AMD64:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#1361
________________________________________________________
#if defined(__amd64)
1360
1361 ENTRY(membar_enter)
1362 ALTENTRY(membar_exit)
1363 ALTENTRY(membar_sync)
1364 mfence /* lighter weight than lock; xorq $0,(%rsp) */
1365 ret
1366 SET_SIZE(membar_sync)
1367 SET_SIZE(membar_exit)
1368 SET_SIZE(membar_enter)
1369
1370 ENTRY(membar_producer)
1371 sfence
1372 ret
1373 SET_SIZE(membar_producer)
1374
1375 ENTRY(membar_consumer)
1376 lfence
1377 ret
1378 SET_SIZE(membar_consumer)
1379
1380 #else
1381
1382 ENTRY(membar_enter)
1383 ALTENTRY(membar_exit)
1384 ALTENTRY(membar_sync)
1385 lock
1386 xorl $0, (%esp)
1387 ret
1388 SET_SIZE(membar_sync)
1389 SET_SIZE(membar_exit)
1390 SET_SIZE(membar_enter)
________________________________________________________

They seem to even put an MFENCE in `membar_exit()' as well! ;^(...

BTW, I wonder why they are actually calling `LFENCE' and `SFENCE'. Humm.
Those are only good for non-temporal memory:

http://stackoverflow.com/questions/37070/what-is-the-meaning-of-non-temporal-memory-accesses-in-x86

Humm...

[...]

Chris M. Thomasson · Nov 11, 2010

Jorgen Grahn said:
What alternatives do you mean? I think the proper question here is
more like "what useful work could I have done in the time I used to
lock and unlock mutexes"?

FWIW, you can take advantage of try_lock functionality in order to use
mutexs more efficiently in "certain cases". For instance, imagine a pool of
threads that receive messages, analyze them and then update some global
state that is protected by a single mutex. Now, how can we possibly even try
to reduce the contention on this obvious bottleneck? Well, we get attempt to
get "clever" with a simple adaptive try_lock scheme. Here is some high level
pseudo-code that demonstrates the overall technique:

<quickly typed directly into newsreader; please forgive any typos! ;^) >
___________________________________________________________
struct message
{
virtual void process() = 0;
};

#define CACHE_MAX 128

static std::mutex g_single_global_mutex_bottleneck; // ;^(...

void worker_thread()
{
message* cache[CACHE_MAX];
size_t cache_sz = 0;

for (;

{
if (cache_sz < CACHE_MAX)
{
if (! (cache[cache_sz] = try_to_gather_a_message()))
{
if (cache_sz) goto worker_thread_lock_jump;
cache[cache_sz++] = wait_for_a_message();
}

else
{
++cache_sz;
}

if (! g_single_global_mutex_bottleneck.try_lock()) continue;
}

else
{
worker_thread_lock_jump:
g_single_global_mutex_bottleneck.lock();
}

// we are within the critical section!
// process all of the messages we have cached

for (size_t i = 0; i < cache_sz; ++i)
{
cache->process();
}

g_single_global_mutex_bottleneck.unlock();

cache_sz = 0;
}
}
___________________________________________________________

See what's going on here? Basically, a worker thread will gather a message
then try to lock the mutex. If that attempt fails, it simple try's to gather
another message and try the lock again. It can build up to `CACHE_MAX'
messages before it actually blocks on the mutex. Here is the rational... Why
should a worker thread block on the mutex when it can try to do other useful
work and try the lock again? This is basically like an adaptive mutex except
the thread does actual work instead of spinning on a mutex internal state
variable.

Any thoughts on the overall approach?

James Kanze · Nov 11, 2010

news:b7531a6f-7c92-482c-a24b-342515d918e6@e20g2000vbn.googlegroups.com...

Click to expand...

[...]

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...

Click to expand...

Compared to what?

Click to expand...

Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.

Click to expand...

Certainly. But when is synchronization a bottleneck (except in
cases of contention)?

Nearly all of the mutex implementations I have checked
out/disassembled have two atomic RMW operations. One for
locking, and the other one for unlocking.

Click to expand...

Oops. Yes, I misread you. I was just thinking of the lock.
You have to do the same thing when unlocking, of course.

The `#StoreLoad' style memory barrier is fairly expensive. It
forces a strict ordering and basically ruins any accumulated
cache that the executing processor may have had...

Click to expand...

I know, but if you're invoking it that often, you probably have
a problem of granularity. (But I'm sure that there are
exceptions.)

It's not a question of the ratio of time: you can probably
synchronize twice as fast as a mutex for some uses. But the
fact remains that the total time in both cases is an
insignificant part of the time of most applications. If I'm
passing a message in a queue, who cares whether it takes 750ns,
or 1.5us, if the calculations behind it take hundreds of
microseconds?

Niklas Holsti · Nov 11, 2010

Chris said:
Also, if you don't carefully layout the mutex data-structures in memory, you
can suffer from performance destroying false-sharing... ...
Then you have to make sure that this structure is aligned on a cache line
boundary in memory. Sometimes it helps to strive to store the actual shared
data and the mutex data-structure within a single cache line. Something
like:
____________________________________________________
#define L2CACHELINE 128

struct shared_data
{
char m_var0;
int m_var1;
int m_var2;
};

union SHARED_DATA_CS_PAD
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
char l2cache_pad[L2CACHELINE];
};

I'm only a novice in C++, but it seems to that the above union overlays
the CRITICAL_SECTION with the shared_data, which would most likely cause
a mess at run-time. Did you intend something like this:

struct lock_and_data
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
}

union SHARED_DATA_CS_PAD
{
lock_and_data m_ld;
char l2cache_pad[L2CACHELINE];
}

Chris M. Thomasson · Nov 12, 2010

Niklas Holsti said:
Chris M. Thomasson wrote: [...]

union SHARED_DATA_CS_PAD
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
char l2cache_pad[L2CACHELINE];
};

Click to expand...

I'm only a novice in C++, but it seems to that the above union overlays
the CRITICAL_SECTION with the shared_data, which would most likely cause a
mess at run-time. Did you intend something like this:

struct lock_and_data
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
}

union SHARED_DATA_CS_PAD
{
lock_and_data m_ld;
char l2cache_pad[L2CACHELINE];
}

YIKES! Of course you are 100% correct!!!

:^o

Chris M. Thomasson · Nov 14, 2010

James Kanze said:
news:b7531a6f-7c92-482c-a24b-342515d918e6@e20g2000vbn.googlegroups.com...

Click to expand...

[...]

Click to expand...

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...
Compared to what?

Click to expand...

Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.

Click to expand...

Certainly. But when is synchronization a bottleneck (except in
cases of contention)?

Click to expand...

IMVHO, shared writes to a single location is a "bottleneck" by default; it
simply cannot scale...

Oops. Yes, I misread you. I was just thinking of the lock.
You have to do the same thing when unlocking, of course.

Click to expand...

Indeed. ;^)

I know, but if you're invoking it that often, you probably have
a problem of granularity. (But I'm sure that there are
exceptions.)

Click to expand...

Well, no matter how fine-grained the data-structure is, a thread will always
need to execute the `#StoreLoad' memory barrier. Think about it for a
moment. That barrier is executed regardless if there is any contention or
not...

It's not a question of the ratio of time: you can probably
synchronize twice as fast as a mutex for some uses.

Click to expand...

^^^^^^^^^^^^^^

Order of magnitude... Indeed I have seen clever non-blocking synchronization
simply blow a nice lock-based solution completely out of the damn water;
bloody mess!

Ouch. :^/

But the
fact remains that the total time in both cases is an
insignificant part of the time of most applications. If I'm
passing a message in a queue, who cares whether it takes 750ns,
or 1.5us, if the calculations behind it take hundreds of
microseconds?

Click to expand...

Fair point. Well, all of that time does add up. Take a data-structure
protected by a read/write mutex versus one that is protected by a mechanism
that allows readers to execute concurrently with writers. Okay, let's focus
on the reads. Many rw-mutex implementations I have seen have 2 atomic RMW
and 2 memory barriers for a read_lock/unlock pair. On the other hand, the
"exotic" reader/writer solution does not require any atomics RMW, or any
explicit memory barriers. Here is the scenario:

10 reader threads receive search queries and execute them:
___________________________________________
void reader_thread()
{
for (;
{
msg* m = wait_for_message();
data_structure.search(m);
}
}
___________________________________________

Let's say that all of the reader threads happened to get 10,000 search
requests each. Well, in the lock-based version, that would be 20,000 atomic
RMW and 20,000 memory barriers per-thread. In the non-blocking reader
version... Well, let's just say that there would be zero atomic/explicit
membars!

Keep in mind that the lock-based version would execute 200,000 atomic RMW
and 200,000 membars in _total_ to process 100,000 queries!!!! ;^/

:^o

Do the numbers make sense to you James? What do you think?

Jorgen Grahn · Nov 17, 2010

I was thinking of various lock free algorithms, using atomic
modifications (CAS, atomic increment, etc.). These all require
some sort of fence or membar, which is all that
pthread_mutex_lock does if there is no contention.

*IF* there is a performance problem, the question is: what are
the alternatives, and will they be faster? But that's a big if.

It's common in my experience to find code that locks and unlocks
mutexes in tight loops, for no good reason at all. I.e. the
alternative is often more coarse-grained locking.

Not that I know much about any of this: *all* multithreaded code I
have worked with has been broken by design, and in the only cases
where this was successfully fixed, the fix was to remove the
multithreading, which served no purpose anyway.

I am sure you can do multithreading correctly, but I haven't seen it
myself yet.

/Jorgen

Blocking queue race condition?	10	Jan 5, 2012
Multithreading and compatibility library (libconfig)	1	Jan 23, 2013
possible typo in multithreading website	6	Apr 12, 2009
Request for comments about synchronized queue using boost	15	Oct 15, 2008
Python Tutorial on Multithreading	3	Feb 21, 2011
Multithreading and Queue	1	Apr 25, 2006
introduction and first question about multithreading	1	Jul 11, 2012
Queue cleanup	77	Aug 11, 2010

STL Queue and multithreading

naveen

Anthony Williams

Goran Pusic

Juha Nieminen

Fred Zwarts

James Kanze

Chris M. Thomasson

Jorgen Grahn

James Kanze

James Kanze

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

James Kanze

Niklas Holsti

Chris M. Thomasson

Chris M. Thomasson

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads