How to make this program more efficient?

courpron · Sep 21, 2008

The question is whether or not those are always a problem. They are a
problem if determinism is required but determinism is not always required,
e.g. when caching the results of pure computations.

I said you get the ordering problem and caching problem on most SMP
architectures (i.e they are inherent problems of SMP architectures), I
didn't say that you get those problems on all algorithms. In fact, I
can write a simple algorithm where no synchronization at all is needed
because the shared object is accessed(read and written) concurrently
but never used.
Back to the topic, without any specific details about the
architecture, the RCU, by default, needs memory barriers to access the
pointer.

Alexandre Courpron.

Chris M. Thomasson · Sep 21, 2008

[comp.programming.threads added]

If there's any thread modifying the object, then all accesses
must be synchronized. Period.

Did you know that readers threads can access data-structures without sync?
Or perhaps, any explicit sync whatsoever? No atomic RMW, no membars... Think
about it. Virtually zero-overhead of reader threads indeed. Sounds to good
to be true? Why? Perhaps I can clarify.

Given the usage pattern, it may
be more interesting to use a rwlock than a mutex, but this
really depends on how long the readers hold the mutex, and how
many of them there are.

Sure. For read-mostly access patterns, WHY use a rwmutex? Did you know that
rwmutex can be totally suboptimal even in the presence of concurrent
mutators?

[...]

Chris M. Thomasson · Sep 21, 2008

I said you get the ordering problem and caching problem on most SMP
architectures (i.e they are inherent problems of SMP architectures), I
didn't say that you get those problems on all algorithms. In fact, I
can write a simple algorithm where no synchronization at all is needed
because the shared object is accessed(read and written) concurrently
but never used.
Back to the topic, without any specific details about the
architecture, the RCU, by default, needs memory barriers to access the
pointer.

RCU, from the point of reader threads, does NOT need explicit membars to
access and/or deference the pointer on virtually all major archs out there;
DEC Alpha aside for a moment. Dependant load barrier anyone?

;^D

courpron · Sep 21, 2008

RCU, from the point of reader threads, does NOT need explicit membars to
access and/or deference the pointer on virtually all major archs out there;
DEC Alpha aside for a moment. Dependant load barrier anyone?

Or on any non-CC architectures. As I said, *without any specific
details about the architecture*, RCU needs, by default, membars for
reads and writes. However, for read accesses, membars should indeed
not be needed on most common architectures.

Alexandre Courpron.

James Kanze · Sep 21, 2008

[comp.programming.threads added]

"James Kanze" <[email protected]> wrote in message

Did you know that readers threads can access data-structures
without sync?

Not according to Posix, and not on many machines. Reading
without a sync will probably give you some value, but not
necessarily the last written.

Or perhaps, any explicit sync whatsoever? No atomic RMW, no
membars... Think about it. Virtually zero-overhead of reader
threads indeed. Sounds to good to be true? Why? Perhaps I can
clarify.

Sure. For read-mostly access patterns, WHY use a rwmutex? Did
you know that rwmutex can be totally suboptimal even in the
presence of concurrent mutators?

And? A rwlock, even when acquired for just reading, ensures
memory synchronization.

James Kanze · Sep 21, 2008

@m44g2000hsc.googlegroups.com>, (e-mail address removed) says...

[...]

Interestingly, the layout I'm talking about will generally work
correctly even on an architecture with virtually _no_ support for atomic
operations -- modifying one bit in a word will normally be atomic
regardless, simply because it's essentially impossible to spread a write
fo a single bit over more than one clock cycle.

Writing a single bit is NOT atomic on any of the machines I'm
familiar with. Because it is essentially impossible to write a
single bit, so writing a single bit is implemented as
read/modify/write.

peter koch · Sep 21, 2008

You are assuming that it is necessary to get the most recent value when, in
fact, it is not.

So you are describing a situation, where a not recent computation (or
no computation at all) is satisfactory. Also you are describing a
situation where the result must be quite small - be representable by
one word. If not, one could imagine some part of the result
represented a recent computation whereas another part represented an
old one.

I imagine there might be some scenarios where this might be okay, but
it must be a very minor niche that fullfills these properties.

/Peter

Jerry Coffin · Sep 22, 2008

@m44g2000hsc.googlegroups.com>, (e-mail address removed) says...
[...]
Interestingly, the layout I'm talking about will generally work
correctly even on an architecture with virtually _no_ support for atomic
operations -- modifying one bit in a word will normally be atomic
regardless, simply because it's essentially impossible to spread a write
fo a single bit over more than one clock cycle.

Click to expand...

Writing a single bit is NOT atomic on any of the machines I'm
familiar with. Because it is essentially impossible to write a
single bit, so writing a single bit is implemented as
read/modify/write.

Sorry, poor wording on my part. If you tried to only change that bit, it
wouldn't be atomic -- but here you're writing an entire word, of which
only one bit ever changes. Even if you were on a machine where writing
the word wasn't atomic (e.g. you used a 64-bit word on a 32-bit machine)
only the one bit ever changes, so the write to that minimum storage unit
ends up atomic -- e.g. in the case above, of a 64-bit word on a 32-bit
machine that could only update 32 bits at a time, you'd end up writing
the word in two pieces -- but one of those pieces never changes the
value, the only part that matters is the change to part that actually
changes -- and since that's only a single bit, it'll always be contained
in a single item of the size the processor can write atomically.

peter koch · Sep 22, 2008

No. The one word can be a pointer to an arbitrarily-large data structure.
The pointer is mutable and the data structure is typically persistent and
immutable.

On the contrary, there are a huge number of applications ranging from
high-performance numerics in scientific computing to ordinary databases.

For example, if I post a blog comment just as someone downloads the blog
article it makes no practical difference whether or not they see the latest
comment. The only thing that matters is that nobody gets corrupted data
(e.g. half of the old pointer and half of the new pointer) which requires
atomic word-size write but not locks and that updates propagate
sufficiently quickly in practice.

But I do not see how you avoid the problem that half of the result has
been propagated to the other thread together with the new pointer
value, but the other half of the result has not?

struct result
{
int value;
int result;
};

result result_vec[2] = { result(1,1), result(0,0) };

// global
result* pres = &result_vec[0];

// thread 1:
result temp(2,func(2));
result[1] = temp;
pres = result + 1;

//thread 2:
result* local = pres;
std::cout << "Result: " << *local;

What prevents thread 2 to output e.g. a result(2,0)?

/Peter

peter koch · Sep 22, 2008

But I do not see how you avoid the problem that half of the result has
been propagated to the other thread together with the new pointer
value, but the other half of the result has not?

Click to expand...

struct result
{
int value;
int result;
};

Click to expand...

result result_vec[2] = { result(1,1), result(0,0) };

Click to expand...

// global
result* pres = &result_vec[0];

Click to expand...

// thread 1:
result temp(2,func(2));
result[1] = temp;
pres = result + 1;

Click to expand...

//thread 2:
result* local = pres;
std::cout << "Result: " << *local;

Click to expand...

What prevents thread 2 to output e.g. a result(2,0)?

Click to expand...

You are representing the value as an unboxed mutable struct. You need to
represent it as a mutable pointer to an immutable data structure:

struct result { const int value, result; };

To update, you write a single "result *".

Immutable data structures can be shared between threads safely. However,
they cannot be implemented efficiently without a performant run-time
(allocator and GC) so this is not feasible in C++.

I still don't get it. Are you telling me that we if we changed to your
structure definition and my thread 1 code to
// thread 1:
result temp(2,func(2));
pres = &temp;

(fullfilling both of your requirements), then thread 2 will always
have a consistent result - without any kind of synchronisation?
I say you will not and that you need a memory barrier writing the
pointer (and nothing more).

/Peter

courpron · Sep 23, 2008

[...]

I say you will not and that you need a memory barrier writing the
pointer (and nothing more).

Click to expand...

Assuming you mean "deterministic" when you say "consistent", I think you are
wrong. Even if both the reader and writer lock the pointer (which is
pointless if the operations are already atomic) there is still a race
condition. Specifically, the reader (thread 2) might read either the old or
new value.

You still make a confusion between locks and memory barriers.

You have to take care about the memory ordering problem, at least on
SMP architectures. Well, even on a monoprocessor architecture,
instruction can be reordered statically, by the compiler. You use
memory barriers to deal with these problems. In the case of static
reordering, the compiler will see the memory barriers and take them
into account (if it doesn't then I don't consider this compiler to
handle multithreading ).

[...]
void set(int n, int m)
{
struct pair * volatile t = (struct pair *)malloc(sizeof(struct pair));
t->i = n;
t->j = m;
/* Atomic pointer write. */
#ifdef LOCKING
pthread_mutex_lock(&mutex);
#endif
global = t;
#ifdef LOCKING
pthread_mutex_unlock(&mutex);
#endif

}
[...]

In your set function, without the locks, the intructions can be
reordered either at runtime on a SMP architecture or by the compiler.
Let's say we are on a monoprocessor, and the compiler reorders the
instructions. Data dependence analysis won't prevent reordering like
this :

t->i = n;
global = t;
// what happens if a reader uses global here ?
t->j = m;

Here the unfortunate reader will see a correct value for i but an
incorrect one for j. To prevent things like this, you need memory
barriers (and not locks which are overkill for this case).

Alexandre Courpron.

courpron · Sep 23, 2008

[...]
t->i = n;
global = t;
// what happens if a reader uses global here ?
t->j = m;

Here the unfortunate reader will see a correct value for i but an
incorrect one for j.

Note : I should have said that, in this case, j is uninitialized, not
that j gets an incorrect value.

Alexandre Courpron.

courpron · Sep 23, 2008

[...]

In your set function, without the locks, the intructions can be
reordered either at runtime on a SMP architecture or by the compiler.

Click to expand...

Does my use of the "volatile" keyword not prohibit that?

Ah yes, you used a volatile pointer.
It doesn't prohibit this problem at least with runtime reordering. In
case of static reordering, the compiler shall not reorder memory
access in the presence of volatile object (says the standard), but
consulting the compiler manual is a necessity since volatile has
sometimes not a C++ standard behavior depending on the compiler (e.g.
visual C++).

In a SMP environnement, volatile has no effect on the reordering
problem. That's the main problem of the volatile keyword. That's why
volatile is often said to be useless in a multithreaded environnement.
That's also why memory barriers are needed in this case.

Alexandre Courpron.

peter koch · Sep 23, 2008

[email protected] said:
[email protected] said:

[...]
I say you will not and that you need a memory barrier writing the
pointer (and nothing more).
Assuming you mean "deterministic" when you say "consistent", I think you
are wrong. Even if both the reader and writer lock the pointer (which is
pointless if the operations are already atomic) there is still a race
condition. Specifically, the reader (thread 2) might read either the old
or new value.

Click to expand...

Click to expand...

You still make a confusion between locks and memory barriers.

Click to expand...

Sorry: I read "synchronization" and assumed Peter was referring to locks.

You have to take care about the memory ordering problem, at least on
SMP architectures...
Yes.

In your set function, without the locks, the intructions can be
reordered either at runtime on a SMP architecture or by the compiler.

Click to expand...

Does my use of the "volatile" keyword not prohibit that?

peter koch · Sep 23, 2008

[email protected] said:
[email protected] said:

[...]
I say you will not and that you need a memory barrier writing the
pointer (and nothing more).
Assuming you mean "deterministic" when you say "consistent", I think you
are wrong. Even if both the reader and writer lock the pointer (which is
pointless if the operations are already atomic) there is still a race
condition. Specifically, the reader (thread 2) might read either the old
or new value.

Click to expand...

Click to expand...

You still make a confusion between locks and memory barriers.

Click to expand...

Sorry: I read "synchronization" and assumed Peter was referring to locks.

Well - I can speak for myself and did not mean locks - simply
synchronisation.

Does my use of the "volatile" keyword not prohibit that?

I don't believe so. Extensions might give stricter guarantees (and
Microsoft does so, if I understand correctly).
But volatile is not needed as long as you have a memory barrier on the
pointer. If the pointer is written using "normal" means your code is
still not guaranteed to be visible only after all parts of the
structure has been written.

/Peter

gpderetta · Sep 23, 2008

[...]

In your set function, without the locks, the intructions can be
reordered either at runtime on a SMP architecture or by the compiler.

Click to expand...

Click to expand...

Does my use of the "volatile" keyword not prohibit that?

Click to expand...

Ah yes, you used a volatile pointer.
It doesn't prohibit this problem at least with runtime reordering. In
case of static reordering, the compiler shall not reorder memory
access in the presence of volatile object (says the standard), but
consulting the compiler manual is a necessity since volatile has
sometimes not a C++ standard behavior depending on the compiler (e.g.
visual C++).

Also, IIRC the standard prohibits only reordering of accesses to
volatile objects. Reads and writes to non volatile objects can still
be reordered even across a volatile.
So volatile doesn't even inhibit many compiler reordering operations.

courpron · Sep 23, 2008

[...]
In your set function, without the locks, the intructions can be
reordered either at runtime on a SMP architecture or by the compiler.
Does my use of the "volatile" keyword not prohibit that?

Click to expand...

Click to expand...

Ah yes, you used a volatile pointer.
It doesn't prohibit this problem at least with runtime reordering. In
case of static reordering, the compiler shall not reorder memory
access in the presence of volatile object (says the standard), but
consulting the compiler manual is a necessity since volatile has
sometimes not a C++ standard behavior depending on the compiler (e.g.
visual C++).

Click to expand...

Also, IIRC the standard prohibits only reordering of accesses to
volatile objects. Reads and writes to non volatile objects can still
be reordered even across a volatile.

You're right.

So volatile doesn't even inhibit many compiler reordering operations.

Indeed, unless you declare all concerned variables as volatile, which
would probably lead to a very inefficient code and still doesn't
prevent the runtime reordering problem.

Alexandre Courpron.

gpderetta · Sep 24, 2008

[email protected] said:
[email protected] said:

[...]
In your set function, without the locks, the intructions can be
reordered either at runtime on a SMP architecture or by the compiler.
Does my use of the "volatile" keyword not prohibit that?

Click to expand...

Click to expand...

Ah yes, you used a volatile pointer.
It doesn't prohibit this problem at least with runtime reordering...

Click to expand...

This raises the question: how can this be written correctly in C or C++?

It can't in C99 nor in C++03 which doens't even acknowledge the
existence of threads. POSIX C requires mutual exclusion around all
shared state that could potentially be concurrently written.

Untill C++0x which will have atomics and memory barrier, you have to
rely on unportable compiler/os extensions.

James Kanze · Sep 24, 2008

On Sep 23, 9:12 pm, (e-mail address removed) wrote:

Also, IIRC the standard prohibits only reordering of accesses
to volatile objects. Reads and writes to non volatile objects
can still be reordered even across a volatile.

Not really. Almost all of the actual specification of volatile
is implementation defined, so an implementation pretty much do
what it wants. The expressed intent in C++ is that it should
work as in C, and the expressed intent in C corresponds pretty
much to what you say, but it remains intent, and not a formal
requirement.

In practice, all of the compilers (Sun CC and g++ on Sparc, g++
and VC++ on Intel/AMD) I know do allow reordering, even when
volatile is used. So the keyword is for all intents and
purposes useless (and when ordering is required, you have to use
some system function or resort to assembler).

So volatile doesn't even inhibit many compiler reordering
operations.

Even less of them than one would expect from its expressed
intent.

James Kanze · Sep 24, 2008

[...]

I don't believe so. Extensions might give stricter guarantees
(and Microsoft does so, if I understand correctly).

Microsoft has proposed such an extension to the meaning of
volatile, and presumable will implement it in future versions of
the compiler, but the version I have available (version 14, from
Visual Studio 8) doesn't implement any additional guarantees for
volatile. In fact, it doesn't even prevent reordering or fusing
of successive accesses to a volatile variable. Which is IMHO
contrary to the intent of the standard (but the actual
normative specification is "implementation defined"), but almost
universal.

But volatile is not needed as long as you have a memory
barrier on the pointer.

Provided the compiler knows this. Memory barriers are necessary
at the hardware level, but you still have to prevent code
movement by the compiler. This can be done in two ways:

-- the memory barriers (except that I think they're called
fences by Intel) are implemented in a system routine (e.g.
the Interlocked... functions under Windows, or most of the
threading functions under either Windows or Unix), and the
compiler "knows" about this (required by Posix), or

-- the memory barriers are implemented in embedded assembler,
and you've taken steps to prevent code movement by the
compiler accross this assembler (or verified that the
compiler doesn't do it).

STL: Could you make this snippet more efficient	12	Dec 3, 2007
FAQ 3.19 How can I make my CGI script more efficient?	0	Jan 28, 2011
How to use SQLite (sqlite3) more efficiently	0	Jun 5, 2014
Efficient Christmas to everyone	5	Dec 23, 2009
How to use SQLite (sqlite3) more efficiently	0	Jun 6, 2014
Make Python Compilable, convert to Python source to Go	12	May 25, 2014
FAQ 3.16 How can I make my Perl program take less memory?	0	Feb 4, 2011
Is there a better, more efficient way to write this program?	8	Nov 10, 2005

How to make this program more efficient?

courpron

Chris M. Thomasson

Chris M. Thomasson

courpron

James Kanze

James Kanze

peter koch

Jerry Coffin

peter koch

peter koch

courpron

courpron

courpron

peter koch

peter koch

gpderetta

courpron

gpderetta

James Kanze

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads