STL Queue and multithreading

N

naveen

Hi guys
For STL Queue, I would like to analyze following scenerio
1. Multiple threads writting to the Q and multiple threads reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
2. Multiple threads writting to the Q and **single** thread reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
3. Single thread writting to the Q and single thread( different than
the first one) reading from it. Is the mutex needed in this case
also ? or we can live without mutex.

I would like to minimize the mutex usage, please go through the
scenerios written above.
If mutex is needed in all the cases, can you suggest me some other
data structure where message can be written and read without locks or
with minimum locking
thanks
Naveen
 
A

Anthony Williams

naveen said:
For STL Queue, I would like to analyze following scenerio
1. Multiple threads writting to the Q and multiple threads reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
2. Multiple threads writting to the Q and **single** thread reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
3. Single thread writting to the Q and single thread( different than
the first one) reading from it. Is the mutex needed in this case
also ? or we can live without mutex.

For a std::queue<T> you need a mutex in all cases. None of the C++
standard library types can be modified by multiple threads safely
without a mutex, apart from a few documented cases in the C++0x library
(e.g. std::atomic said:
If mutex is needed in all the cases, can you suggest me some other
data structure where message can be written and read without locks or
with minimum locking

If you want no locks, search for "lock free queue". There are several
around, and some have been posted to this newsgroup.

For lock-free stuff, the use cases are important --- the "best"
structure for each case you outlined above is likely to be different.

Anthony
 
G

Goran Pusic

Hi guys
For STL  Queue, I would like to analyze following scenerio
1. Multiple threads writting to the Q and multiple threads reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
2. Multiple threads writting to the Q and **single** thread reading
from it. Mutex will be needed since multiple threads might be
accessing same address from the Q.
3. Single thread writting to the Q and single thread( different than
the first one) reading from it. Is the mutex needed in this case
also ? or we can live without mutex.

You need synchronization all the time.
I would like to minimize the mutex usage

Minimize? Why? What performance numbers you have to back that? Do you
have heavy synchronization that breaks your speed? If so, you should
be looking for a mutex that matches your usage pattern. You should be
looking at various read/write lock implementations. You should also be
able to find ready-made containers, too.

Goran.
 
J

Juha Nieminen

Goran Pusic said:
Minimize? Why? What performance numbers you have to back that?

It's a well-known fact that software mutexes are extremely slow and
can seriously impact performance in situations where the data container
needs to be accessed a very large amount of times very fast.
 
F

Fred Zwarts

Juha Nieminen said:
It's a well-known fact that software mutexes are extremely slow and
can seriously impact performance in situations where the data
container needs to be accessed a very large amount of times very fast.

Can you back this up?
My experience with posix mutexes is very different.
Normally the time for (un)locking a mutex is negligible compared to other processing,
(except in the case that locking results in a wait for another thread
or unlocking results in waking anonther thread).
 
J

James Kanze

It's a well-known fact that software mutexes are extremely
slow and can seriously impact performance in situations where
the data container needs to be accessed a very large amount of
times very fast.

Beware of "well known facts" that aren't backed by actual
measures. I don't know about Windows, but under Solaris, an
uncontested pthread_mutex_lock is not significantly slower than
any of the alternatives, at least in the benchmarks I've run.
From what I understand, Linux and CriticalSection under Windows
use a similar technique, avoiding a switch to system state
unless there is contention.
 
C

Chris M. Thomasson

Juha Nieminen said:
It's a well-known fact that software mutexes are extremely slow and
can seriously impact performance in situations where the data container
needs to be accessed a very large amount of times very fast.

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations and 2 memory
barriers, one of which may very well be a #StoreLoad. All of that can be
fairly expensive...

Also, if you don't carefully layout the mutex data-structures in memory, you
can suffer from performance destroying false-sharing... For instance, it's
probably a good idea to pad a `CRITICAL_SECTION' data-structure in Windows.
Something like:
____________________________________________________
#define L2CACHELINE 128


union CRITICAL_SECTION_PAD
{
CRITICAL_SECTION m_lock;
char l2cache_pad[L2CACHELINE];
};
____________________________________________________



Then you have to make sure that this structure is aligned on a cache line
boundary in memory. Sometimes it helps to strive to store the actual shared
data and the mutex data-structure within a single cache line. Something
like:
____________________________________________________
#define L2CACHELINE 128


struct shared_data
{
char m_var0;
int m_var1;
int m_var2;
};


union SHARED_DATA_CS_PAD
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
char l2cache_pad[L2CACHELINE];
};
____________________________________________________




Using user-space mutexs efficiently can be "tricky"...


;^)
 
J

Jorgen Grahn

Beware of "well known facts" that aren't backed by actual
measures. I don't know about Windows, but under Solaris, an
uncontested pthread_mutex_lock is not significantly slower than
any of the alternatives, at least in the benchmarks I've run.

What alternatives do you mean? I think the proper question here is
more like "what useful work could I have done in the time I used to
lock and unlock mutexes"?

But the answer is (you surely agree): "If it really matters to you,
benchmark".

/Jorgen
 
J

James Kanze

What alternatives do you mean?

I was thinking of various lock free algorithms, using atomic
modifications (CAS, atomic increment, etc.). These all require
some sort of fence or membar, which is all that
pthread_mutex_lock does if there is no contention.
I think the proper question here is
more like "what useful work could I have done in the time I used to
lock and unlock mutexes"?

*IF* there is a performance problem, the question is: what are
the alternatives, and will they be faster? But that's a big if.
But the answer is (you surely agree): "If it really matters to you,
benchmark".

Well, I'd start by assuming that it doesn't matter, until proven
otherwise:). Even benchmarking is unnecessary extra work if
the code is fast enough when written normally. (But of course,
if it isn't, you don't do anything until you've profiled.)
 
J

James Kanze

FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...

Compared to what? (And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)

The memory barriers are expensive compared to any single
instruction, but in most programs, you shouldn't require the
lock very often. And they aren't expensive, or even
significant, compared to any other important calculations.
 
C

Chris M. Thomasson

James Kanze said:
[...]
FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...

Compared to what?

Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.



(And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)

Nearly all of the mutex implementations I have checked out/disassembled have
two atomic RMW operations. One for locking, and the other one for unlocking.



The memory barriers are expensive compared to any single
instruction, but in most programs, you shouldn't require the
lock very often. And they aren't expensive, or even
significant, compared to any other important calculations.

The `#StoreLoad' style memory barrier is fairly expensive. It forces a
strict ordering and basically ruins any accumulated cache that the executing
processor may have had...
 
C

Chris M. Thomasson

Chris M. Thomasson said:
James Kanze said:
On Nov 10, 10:26 pm, "Chris M. Thomasson" <[email protected]> wrote:
[...]
(And why two atomic operations? I think
I could implement a Mutex lock with a single CAS.)

Nearly all of the mutex implementations I have checked out/disassembled
have two atomic RMW operations. One for locking, and the other one for
unlocking.

For instance, take a look at the following code snippet from OpenSolaris:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/atomic.h
________________________________________________________
/*
234 * Generic memory barrier used during lock entry, placed after the
235 * memory operation that acquires the lock to guarantee that the
lock
236 * protects its data. No stores from after the memory barrier will
237 * reach visibility, and no loads from after the barrier will be
238 * resolved, before the lock acquisition reaches global visibility.
239 */
240 extern void membar_enter(void);
241
242 /*
243 * Generic memory barrier used during lock exit, placed before the
244 * memory operation that releases the lock to guarantee that the
lock
245 * protects its data. All loads and stores issued before the
barrier
246 * will be resolved before the subsequent lock update reaches
visibility.
247 */
248 extern void membar_exit(void);
________________________________________________________




Now, take a look at how they are implemented on a SPARC V9:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/common/atomic/sparcv9/atomic.s#915
________________________________________________________
ENTRY(membar_enter)
916 membar #StoreLoad|#StoreStore
917 retl
918 nop
919 SET_SIZE(membar_enter)
920
921 ENTRY(membar_exit)
922 membar #LoadStore|#StoreStore
923 retl
924 nop
925 SET_SIZE(membar_exit)
________________________________________________________


That #StoreLoad ruins performance! Ouch... ;^o




Even expensive on x86. Take a look at how they are implmented on AMD64:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#1361
________________________________________________________
#if defined(__amd64)
1360
1361 ENTRY(membar_enter)
1362 ALTENTRY(membar_exit)
1363 ALTENTRY(membar_sync)
1364 mfence /* lighter weight than lock; xorq $0,(%rsp) */
1365 ret
1366 SET_SIZE(membar_sync)
1367 SET_SIZE(membar_exit)
1368 SET_SIZE(membar_enter)
1369
1370 ENTRY(membar_producer)
1371 sfence
1372 ret
1373 SET_SIZE(membar_producer)
1374
1375 ENTRY(membar_consumer)
1376 lfence
1377 ret
1378 SET_SIZE(membar_consumer)
1379
1380 #else
1381
1382 ENTRY(membar_enter)
1383 ALTENTRY(membar_exit)
1384 ALTENTRY(membar_sync)
1385 lock
1386 xorl $0, (%esp)
1387 ret
1388 SET_SIZE(membar_sync)
1389 SET_SIZE(membar_exit)
1390 SET_SIZE(membar_enter)
________________________________________________________


They seem to even put an MFENCE in `membar_exit()' as well! ;^(...


BTW, I wonder why they are actually calling `LFENCE' and `SFENCE'. Humm.
Those are only good for non-temporal memory:

http://stackoverflow.com/questions/37070/what-is-the-meaning-of-non-temporal-memory-accesses-in-x86


Humm...

[...]
 
C

Chris M. Thomasson

Jorgen Grahn said:
What alternatives do you mean? I think the proper question here is
more like "what useful work could I have done in the time I used to
lock and unlock mutexes"?

FWIW, you can take advantage of try_lock functionality in order to use
mutexs more efficiently in "certain cases". For instance, imagine a pool of
threads that receive messages, analyze them and then update some global
state that is protected by a single mutex. Now, how can we possibly even try
to reduce the contention on this obvious bottleneck? Well, we get attempt to
get "clever" with a simple adaptive try_lock scheme. Here is some high level
pseudo-code that demonstrates the overall technique:

<quickly typed directly into newsreader; please forgive any typos! ;^) >
___________________________________________________________
struct message
{
virtual void process() = 0;
};


#define CACHE_MAX 128


static std::mutex g_single_global_mutex_bottleneck; // ;^(...


void worker_thread()
{
message* cache[CACHE_MAX];
size_t cache_sz = 0;

for (;;)
{
if (cache_sz < CACHE_MAX)
{
if (! (cache[cache_sz] = try_to_gather_a_message()))
{
if (cache_sz) goto worker_thread_lock_jump;
cache[cache_sz++] = wait_for_a_message();
}

else
{
++cache_sz;
}

if (! g_single_global_mutex_bottleneck.try_lock()) continue;
}

else
{
worker_thread_lock_jump:
g_single_global_mutex_bottleneck.lock();
}

// we are within the critical section!
// process all of the messages we have cached

for (size_t i = 0; i < cache_sz; ++i)
{
cache->process();
}

g_single_global_mutex_bottleneck.unlock();

cache_sz = 0;
}
}
___________________________________________________________



See what's going on here? Basically, a worker thread will gather a message
then try to lock the mutex. If that attempt fails, it simple try's to gather
another message and try the lock again. It can build up to `CACHE_MAX'
messages before it actually blocks on the mutex. Here is the rational... Why
should a worker thread block on the mutex when it can try to do other useful
work and try the lock again? This is basically like an adaptive mutex except
the thread does actual work instead of spinning on a mutex internal state
variable.


Any thoughts on the overall approach?
 
J

James Kanze

news:b7531a6f-7c92-482c-a24b-342515d918e6@e20g2000vbn.googlegroups.com...
[...]
FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...
Compared to what?

Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.

Certainly. But when is synchronization a bottleneck (except in
cases of contention)?
Nearly all of the mutex implementations I have checked
out/disassembled have two atomic RMW operations. One for
locking, and the other one for unlocking.

Oops. Yes, I misread you. I was just thinking of the lock.
You have to do the same thing when unlocking, of course.
The `#StoreLoad' style memory barrier is fairly expensive. It
forces a strict ordering and basically ruins any accumulated
cache that the executing processor may have had...

I know, but if you're invoking it that often, you probably have
a problem of granularity. (But I'm sure that there are
exceptions.)

It's not a question of the ratio of time: you can probably
synchronize twice as fast as a mutex for some uses. But the
fact remains that the total time in both cases is an
insignificant part of the time of most applications. If I'm
passing a message in a queue, who cares whether it takes 750ns,
or 1.5us, if the calculations behind it take hundreds of
microseconds?
 
N

Niklas Holsti

Chris said:
Also, if you don't carefully layout the mutex data-structures in memory, you
can suffer from performance destroying false-sharing... ...
Then you have to make sure that this structure is aligned on a cache line
boundary in memory. Sometimes it helps to strive to store the actual shared
data and the mutex data-structure within a single cache line. Something
like:
____________________________________________________
#define L2CACHELINE 128


struct shared_data
{
char m_var0;
int m_var1;
int m_var2;
};


union SHARED_DATA_CS_PAD
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
char l2cache_pad[L2CACHELINE];
};

I'm only a novice in C++, but it seems to that the above union overlays
the CRITICAL_SECTION with the shared_data, which would most likely cause
a mess at run-time. Did you intend something like this:

struct lock_and_data
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
}

union SHARED_DATA_CS_PAD
{
lock_and_data m_ld;
char l2cache_pad[L2CACHELINE];
}
 
C

Chris M. Thomasson

Niklas Holsti said:
Chris M. Thomasson wrote: [...]
union SHARED_DATA_CS_PAD
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
char l2cache_pad[L2CACHELINE];
};

I'm only a novice in C++, but it seems to that the above union overlays
the CRITICAL_SECTION with the shared_data, which would most likely cause a
mess at run-time. Did you intend something like this:

struct lock_and_data
{
CRITICAL_SECTION m_lock;
shared_data m_shared;
}

union SHARED_DATA_CS_PAD
{
lock_and_data m_ld;
char l2cache_pad[L2CACHELINE];
}

YIKES! Of course you are 100% correct!!!

:^o
 
C

Chris M. Thomasson

James Kanze said:
news:b7531a6f-7c92-482c-a24b-342515d918e6@e20g2000vbn.googlegroups.com...
FWIW, mutex lock/unlock usually costs 2 atomic RMW operations
and 2 memory barriers, one of which may very well be
a #StoreLoad. All of that can be fairly expensive...
Compared to what?

Well, perhaps compared to efficient synchronization algorithms that are
based on plain atomic loads and stores for instance. Various
single-producer/single-consumer algorithms come to mind; it's basically
impossible for a mutex to outperform them... Period.

Certainly. But when is synchronization a bottleneck (except in
cases of contention)?

IMVHO, shared writes to a single location is a "bottleneck" by default; it
simply cannot scale...



Oops. Yes, I misread you. I was just thinking of the lock.
You have to do the same thing when unlocking, of course.

Indeed. ;^)



I know, but if you're invoking it that often, you probably have
a problem of granularity. (But I'm sure that there are
exceptions.)

Well, no matter how fine-grained the data-structure is, a thread will always
need to execute the `#StoreLoad' memory barrier. Think about it for a
moment. That barrier is executed regardless if there is any contention or
not...



It's not a question of the ratio of time: you can probably
synchronize twice as fast as a mutex for some uses.
^^^^^^^^^^^^^^

Order of magnitude... Indeed I have seen clever non-blocking synchronization
simply blow a nice lock-based solution completely out of the damn water;
bloody mess!

Ouch. :^/



But the
fact remains that the total time in both cases is an
insignificant part of the time of most applications. If I'm
passing a message in a queue, who cares whether it takes 750ns,
or 1.5us, if the calculations behind it take hundreds of
microseconds?

Fair point. Well, all of that time does add up. Take a data-structure
protected by a read/write mutex versus one that is protected by a mechanism
that allows readers to execute concurrently with writers. Okay, let's focus
on the reads. Many rw-mutex implementations I have seen have 2 atomic RMW
and 2 memory barriers for a read_lock/unlock pair. On the other hand, the
"exotic" reader/writer solution does not require any atomics RMW, or any
explicit memory barriers. Here is the scenario:

10 reader threads receive search queries and execute them:
___________________________________________
void reader_thread()
{
for (;;)
{
msg* m = wait_for_message();
data_structure.search(m);
}
}
___________________________________________


Let's say that all of the reader threads happened to get 10,000 search
requests each. Well, in the lock-based version, that would be 20,000 atomic
RMW and 20,000 memory barriers per-thread. In the non-blocking reader
version... Well, let's just say that there would be zero atomic/explicit
membars!

Keep in mind that the lock-based version would execute 200,000 atomic RMW
and 200,000 membars in _total_ to process 100,000 queries!!!! ;^/

:^o




Do the numbers make sense to you James? What do you think?
 
J

Jorgen Grahn

I was thinking of various lock free algorithms, using atomic
modifications (CAS, atomic increment, etc.). These all require
some sort of fence or membar, which is all that
pthread_mutex_lock does if there is no contention.


*IF* there is a performance problem, the question is: what are
the alternatives, and will they be faster? But that's a big if.

It's common in my experience to find code that locks and unlocks
mutexes in tight loops, for no good reason at all. I.e. the
alternative is often more coarse-grained locking.

Not that I know much about any of this: *all* multithreaded code I
have worked with has been broken by design, and in the only cases
where this was successfully fixed, the fix was to remove the
multithreading, which served no purpose anyway.

I am sure you can do multithreading correctly, but I haven't seen it
myself yet.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top