why boost:shared_ptr so slower?

James Kanze · Aug 25, 2009

On Aug 23, 2:25 pm, James Kanze <[email protected]> wrote:

[...]

However, and I think we agree on this, it is a sign at least
that we should be a bit more careful that just claiming a class
is "thread-safe".

Totally agreed. I don't use the word in a positive sense in my
class documentation---I either specify a precise contract, or
state that it is not thread safe (which means that I don't make
any guarantees what so ever). On the other hand, we do need a
term for classes which can be used in multi-threaded
environments; i.e. which specify what you can and cannot do, and
what precautions you have to take. Boost::shared_ptr can
definitely be used in multithreaded environments, provided you
respect its contract---the implementation of std::string in g++
2.95.2 couldn't, because it didn't specify any contract (and
even something as simple as constructing a new string object
involved incrementing and decrementing a global counter).

[...]

Yes, I agree with that entirely and that really is the crux of
my point. And specifically one should not go around saying boost
shared_ptr is just plain unqualified "thread-safe".

It's partially a question of audience: it's generally clearer to
state something along the lines of "it defines a contract for
multithreaded use". But otherwise: what would you say that it
clear and concise to inform a potential user that it can be used
in a multi-threaded environment? ("Thread-aware" ?)

James Kanze · Aug 25, 2009

(Now, the GCC manual is not as clear as I'd like, so I'm not
the most comfortable posting this, but I think I'm right.
Correct me if I'm wrong.)

I think you're right.

I'm not sure if you're misspeaking or actually
misunderstanding. When the term "barrier" is used in the
context of threading, the results are not immediately visible
to other threads, nor even visible on the next matching
barrier. Barriers are conditional visibility. Ex:

//static init
a = 0;
b = 0;

int main()
{ //start thread 1
//start thread 2
}

//thread 1
a = 1;
write_barrier();
b = 2;

//thread 2
cout << b << " ";
read_barrier();
cout << a << endl;

Without the barriers, you may see any of the four possible
outputs:
0 0
0 1
2 0
2 1
With the barriers in place, this only removes one possible
output, leaving the three possibilities:
0 0
0 1
2 1

The definition of visibility semantics is effectively: "If a read
before a read_barrier sees a write after a write_barrier, then all
reads after that read_barrier see all writes before that
write_barrier."

To nitpick your quote:

It is not the case that the write will be immediately visible
to all other threads. Moreso, even if the other thread
executes the correct barrier instruction(s), that write may
still not be visible.

That's a very important point. Without going into details,
there is nothing thread a can do which will force thread b to
see the change. When we say that the barrier makes the change
immediately visible to all other threads, we really mean "to all
other threads that want to see it, and take the necessary steps
to do so". It takes a collaborative effort.

If you want guaranteed visibility, use mutexes. However, even
a mutex in one thread does not guarantee that the write
becomes immediately visible to other threads. The other
threads still need to execute the matching "mutex lock"
instruction(s).

Barriers, used correctly, will also work, but as you say, they
need to be present in all concerned threads.

Joshua Maurice · Aug 25, 2009

Barriers, used correctly, will also work, but as you say, they
need to be present in all concerned threads.

Actually, I was thinking about this more, and I was mistaken
(partially) in my first post. To accomplish atomic increment, you need
mutex-like acquire and release semantics, the semantics people are
used to from posix mutexes. This is generally different than read and
write memory barriers, or fences, as I've seen the terms discussed in
context of the linux kernel (and other places). Acquire and release
semantics imply a global order on the acquiring and releasing,
necessary for atomic increment. There is no such global ordering for
read and write memory barriers as defined by the linux kernel docs,
which only offer conditional visibility as I outlined in my first
post.

- If a read before a read_barrier sees a write after a write_barrier,
then all reads after that read_barrier see all writes before that
write_barrier.
- All (mutex) acquire and release operations (of a particular mutex)
are globally ordered across all threads in a single program. All
writes which occured in a thread before a release are seen by all
threads after an acquire which followed the release.

That is, two threads could execute read/write barriers physically
chronologically after some other thread, but one may see a write and
the other may not see it. With acquire and release semantics, if the
acquire/release happens after, then it must see the write. Conditional
vs guaranteed visibility.

I thought that barrier and fence in the literature referred to the
conditional visibility kind, not the guaranteed visibility kind of
acquire and release semantics. Apparently the puddle is murkier than I
thought. That's quite confusing and annoying.

PS: to add to the quagmire of this thread, I feel that the term
"thread-safe" when applied to a class or interface without context and
additional explanation material is like calling a solution "bad", or
"good", or "correct". All are overly vague, and as a consequence,
mostly devoid of useful content.

Keith H Duggar · Aug 25, 2009

It's partially a question of audience: it's generally clearer to
state something along the lines of "it defines a contract for
multithreaded use". But otherwise: what would you say that it
clear and concise to inform a potential user that it can be used
in a multi-threaded environment? ("Thread-aware" ?)

Personally I think Bloch's terminology is a good place to start.
So in these cases "conditionally thread-safe" (and when external
requirements are greater "thread compatible").

The problem I have with "thread-aware" is that it reminds one of
"cache-aware" used (as you probably know) to describe algorithms
that optimize cache performance. So if someone were to call some
class (or function) "thread-aware" I would expect it might spawn
or otherwise utilize additional (perhaps some optimal number of)
threads to perform its work concurrently.

And thanks again for the POSIX lessons ;-)

KHD

Noah Roberts · Aug 25, 2009

Pete said:
Um, there's another qualifier that got left out:

In a multiple processor system, *when shared_ptr objects
are shared between threads*, avoiding mutations to the
reference count does give you significant gains indeed.

And therein lies a great debate about whether shared_ptr should be
sharable between threads. But that's different from the argument that
objects of class type should always be passed by reference, which is
where this particular subthread started.

LOL!!! Thanks. It's been a hard morning but reading this made me smile.

Chris M. Thomasson · Aug 25, 2009

Actually, I was thinking about this more, and I was mistaken
(partially) in my first post. To accomplish atomic increment, you need
mutex-like acquire and release semantics, the semantics people are
used to from posix mutexes.

WRT basic thread-safety reference counting algorithm, an atomic increment
needs acquire semantics, the decrement needs release semantics, and acquire
is used when the count drops to zero:
________________________________________________________
void inc() {
ATOMIC_FAA(&m_count, 1);

MEMBAR #LoadStore | #LoadLoad;
}

void dec() {
MEMBAR #LoadStore | #StoreStore;

if (ATOMIC_FAA(&m_count, -1) == 1)
{
MEMBAR #LoadStore | #LoadLoad;

// reclaim managed object
}
}
________________________________________________________

You can even do this in the `dec()' procedure because this algorithm only
supports basic thread-safety level:
________________________________________________________
void dec() {
MEMBAR #LoadStore | #StoreStore;

unsigned count = ATOMIC_LOAD(&m_count);

if (count == 1 || ATOMIC_FAA(&m_count, -1) == 1)
{
MEMBAR #LoadStore | #LoadLoad;

// reclaim managed object
}
}
________________________________________________________

Juha Nieminen · Aug 25, 2009

Sam said:
So, if I understand you correctly, it makes more sense to you to blindly
forge ahead, corrupting all memory in sight, rather than having a
well-defined failure case, such as throwing an exception or dumping core
at the exact point of failure, rather than some time later when your
memory is hopelessly corrupted and the root cause may not be immediately
apparent?

It depends. Do you always turn on boundary checks for all STL
containers (if supported by your compiler; many compilers do)? Do you
always use the at() method instead of operator[]? Do you always make the
destructor of every single class and struct virtual? Do you always
inherit virtually?

If your answer to these questions was "yes", then in that context it
might make sense to always use dynamic_cast even if a static_cast would
do. You know, just in case.

James Kanze · Aug 26, 2009

On Aug 25, 6:01 am, James Kanze <[email protected]> wrote:

[...]

Personally I think Bloch's terminology is a good place to
start. So in these cases "conditionally thread-safe" (and
when external requirements are greater "thread compatible").

Thread compatible sounds good to me.

The problem I have with "thread-aware" is that it reminds one
of "cache-aware" used (as you probably know) to describe
algorithms that optimize cache performance. So if someone were
to call some class (or function) "thread-aware" I would expect
it might spawn or otherwise utilize additional (perhaps some
optimal number of) threads to perform its work concurrently.

OK. It was just the first thing which came to mind. (I've been
using "thread-safe" in this context, but paying attention to my
audience, so as not to say anything to someone who is more or
less naïve in this respect.)

James Kanze · Aug 26, 2009

Actually, I was thinking about this more, and I was mistaken
(partially) in my first post. To accomplish atomic increment,
you need mutex-like acquire and release semantics, the
semantics people are used to from posix mutexes.

And what do you thing the implementations of Posix mutexes use
to provide those semantics? The implementations I know do so
without executing protected code (if there is no conflict); they
pretty much do an atomic increment or an atomic exchange at the
user level, and the same machine instructions are available to
any function willing to use a bit of assembler.

This is generally different than read and write memory
barriers, or fences, as I've seen the terms discussed in
context of the linux kernel (and other places). Acquire and
release semantics imply a global order on the acquiring and
releasing, necessary for atomic increment. There is no such
global ordering for read and write memory barriers as defined
by the linux kernel docs, which only offer conditional
visibility as I outlined in my first post.

I'm not familiar with the Linux kernel documents, but I do know
what the membar instruction does on a Sparc architecture. I
also know that on a Sparc, Solaris doesn't enter into kernel
mode in pthread_mutex_t unless there is contention; I believe
that this is true under Linux as well. (I've been told it is,
but I've not actually verified it myself.) Which means that
there are some sort of fence or barrier instructions which can
be used in non-priviledged mode to ensure the desired acquire
and release semantics.

[...]

PS: to add to the quagmire of this thread, I feel that the
term "thread-safe" when applied to a class or interface
without context and additional explanation material is like
calling a solution "bad", or "good", or "correct". All are
overly vague, and as a consequence, mostly devoid of useful
content.

Given that experienced people seem to disagree as to what it
means exactly, that's probably the case. With the difference
that with words like "bad" or "good", the vagueness is patente,
whereas with "thread-safe", everyone thinks they know what it
means.

Insertion Sort : C++ implementation 100 times slower than C implementation	7	Oct 31, 2011
Why does my SDL3 C++ app display very low fps but my frame count is very high	0	Mar 3, 2025
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Why is boost sg_set so slow on ordered insertions?	0	Aug 13, 2012
Why is java consumer/producer so much faster than C++	39	Jul 22, 2012
passing auto_ptr as shared_ptr	2	Dec 22, 2012
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
boost::asio write_some - error	3	Jul 1, 2013

why boost:shared_ptr so slower?

James Kanze

James Kanze

Joshua Maurice

Keith H Duggar

Noah Roberts

Chris M. Thomasson

Juha Nieminen

James Kanze

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads