Memory access vs variable access

G

Gerhard Fiedler

Hello,

I'm not sure whether this is a problem or not, or how to determine whether
it is one.

Say memory access (read and write) happens in 64-bit chunks, and I'm
looking at 32-bit variables. This would mean that either some other
variable is also written when writing a 32-bit variable (which means that
all access to 32-bit variables is of the read-modify-write type, affecting
some other variable also), or that all 32-bit variables are stored in their
own 64-bit chunk.

With single-threaded applications, that's a mere performance question. But
with multi-threaded applications, there's no way I can imagine that would
avoid the read-modify-write problems the first alternative would create, as
it is nowhere defined what the other variable is that is also written -- so
it can't be protected by a lock. Without it being protected by a lock,
there's nothing that prevents a thread from altering it while it is in the
middle of the read-modify-write cycle, which means that the end of it will
overwrite the altered value with the old value.

However, there must be a way to deal with this, otherwise multi-threaded
applications in C++ wouldn't be possible.

What am I missing?

Thanks,
Gerhard
 
G

gpderetta

The fact that C++ does not specify any of that, maybe.

But C++0x will. IIRC, accroding to the draft standard, an
implementation is prohibited to do many kind of speculative writes
(with the exception of bitfields) to locations that wouldn't be
written unconditionally anyway (or something like that).

If a specific architecture didn't allow 32 bit load/stores to 32 bit
objects, it would require the implementation to pad every object to
the smaller load/store granularity. Pretty much all common
architectures allow access to memory at least at 8/16/32 bit
granularity (except for DSPs I guess), so it is not a problem.

Current compilers do not implement the rule above, but thread aware
compilers approximate it well enough that, as long as you use correct
locks, things work correctly *most of the time* (some compilers have
been known to miscompile code which used trylocks for example).
Try 'comp.programming.threads' as your starting point since it's the
multi-threading that you're concerned about.  The problem does not seem
to be language-specific, and as such does not belong to a language
newsgroup.

Actually, discussing whether the next C++ standard prohibits
speculative writes, is language specific and definitely on topic.
 
G

Gerhard Fiedler

Just for the record: I didn't really miss that. I just thought that how a
very common problem present in a sizable part of C++ applications is being
handled across compilers and platforms is actually on topic in a group
about the C++ language.
But C++0x will. IIRC, accroding to the draft standard, an implementation
is prohibited to do many kind of speculative writes (with the exception
of bitfields) to locations that wouldn't be written unconditionally
anyway (or something like that).

If a specific architecture didn't allow 32 bit load/stores to 32 bit
objects, it would require the implementation to pad every object to the
smaller load/store granularity. Pretty much all common architectures
allow access to memory at least at 8/16/32 bit granularity (except for
DSPs I guess), so it is not a problem.

Ah, I didn't know that. So on common hardware (maybe x86, x64, AMD, AMD64,
IA-64, PowerPC, ARM, Alpha, PA-RISC, MIPS, SPARC), memory access is
possible in byte granularity? Which then means that no common compiler
would write to locations that are not the actual purpose of the write
access?
Current compilers do not implement the rule above, but thread aware
compilers approximate it well enough that, as long as you use correct
locks, things work correctly *most of the time* (some compilers have
been known to miscompile code which used trylocks for example).

Do you have any links about which compilers specifically don't create code
that works correctly? One objective of mine is to be able to separate this
"most of the time" into two clearly defined subsets, one of which works
"all of the time" :)
Actually, discussing whether the next C++ standard prohibits
speculative writes, is language specific and definitely on topic.

Is "speculative writes" the technical term for the situation I described?

Thanks,
Gerhard
 
G

gpderetta

Ah, I didn't know that. So on common hardware (maybe x86, x64, AMD, AMD64,
IA-64, PowerPC, ARM, Alpha, PA-RISC, MIPS, SPARC), memory access is
possible in byte granularity? Which then means that no common compiler
would write to locations that are not the actual purpose of the write
access?

All x86 derivatives allow 8/16/32/64 access at any offset. I think
both PowerPC and ARM allows access at any granularity as the access is
properly aligned. IIRC very old Alphas only allowed accessing aligned
32/64 bits (no byte access), but it got fixed because it was extremely
inconvenient. I do not know about IA-64, MIPS, SPARC and PA-RISC, but
I would be extremely surprised if they didn't.
Do you have any links about which compilers specifically don't create code
that works correctly? One objective of mine is to be able to separate this
"most of the time" into two clearly defined subsets, one of which works
"all of the time" :)

Many in corner cases do. Usually these are considered bugs and are
fixed when they are encountered.
See for example http://www.airs.com/blog/archives/79
Is "speculative writes" the technical term for the situation I described?

I'm not sure if it applies to this example. I think that "speculative
store" is defined as the motion of a store outside of its position in
program order (usually sinking it outside of loops or branches). It
doesn't take much to generalize the concept to that of the *addition*
of a store not present in the original program (i.e. adjacent fields
overwrites).

For details see "Concurrency memory model compiler consequences" by
Hans Bohem:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html

HTH,
 
J

James Kanze

I'm not sure whether this is a problem or not, or how to
determine whether it is one.

It's potentially one.
Say memory access (read and write) happens in 64-bit chunks,
and I'm looking at 32-bit variables. This would mean that
either some other variable is also written when writing a
32-bit variable (which means that all access to 32-bit
variables is of the read-modify-write type, affecting some
other variable also), or that all 32-bit variables are stored
in their own 64-bit chunk.
With single-threaded applications, that's a mere performance
question. But with multi-threaded applications, there's no way
I can imagine that would avoid the read-modify-write problems
the first alternative would create, as it is nowhere defined
what the other variable is that is also written -- so it can't
be protected by a lock. Without it being protected by a lock,
there's nothing that prevents a thread from altering it while
it is in the middle of the read-modify-write cycle, which
means that the end of it will overwrite the altered value with
the old value.
However, there must be a way to deal with this, otherwise
multi-threaded applications in C++ wouldn't be possible.

Most hardware provides for single byte writes (even when the
read is always 64 bits), and takes care that it works correctly.
From what I understand, this wasn't the case on some early DEC
Alphas, and it certainly wasn't the case on many older
platforms, where when you wrote a byte, the hardware would read
a word, and rewrite it.

The upcoming version of the standard will address this problem;
if nothing changes, it will require that *most* accesses to a
single "object" work. (The major exception is bit fields. If
you access an object that is declared as a bit field, and any
other thread may modify any object in the containing class, you
need to explicitly synchronize.) Implementations for processors
where the hardware doesn't support this have their work cut out
for them (but better them than us), and byte accesses on such
implementations are likely to be very slow.
 
G

Gerhard Fiedler

Most hardware provides for single byte writes (even when the read is
always 64 bits), and takes care that it works correctly.

What I find a bit disconcerting is that it seems so difficult to find out
whether a given hardware actually does this. Reality seems to confirm that
it actually is "most" (or otherwise "most" programs would probably crash a
lot more than they do), but I haven't found any documentation about any
specific guarantees of specific compilers on specific platforms. (I'm
mainly interested in VC++ and gcc.) Does somebody have any pointers for me?

Thanks,
Gerhard
 
J

Jerry Coffin

[ ... ]
What I find a bit disconcerting is that it seems so difficult to find out
whether a given hardware actually does this. Reality seems to confirm that
it actually is "most" (or otherwise "most" programs would probably crash a
lot more than they do), but I haven't found any documentation about any
specific guarantees of specific compilers on specific platforms. (I'm
mainly interested in VC++ and gcc.) Does somebody have any pointers for me?

There are a number of problems with that. The first is that when you get
to exotic multiprocessors, a lot of ideas have been tried, and even
though only a few have really gained much popularity, there are still
some that bend almost any rule you'd like to make.

Another problem is that even on a given piece of hardware, the behavior
can be less predictable than you'd generally like. For example, recent
versions of the Intel x86 processors all have Memory Type and Range
Registers (MTRRs). Using an MTRR, one can adjust the behavior of memory
writes individually for ranges of memory. You can get write-back
caching, write-through caching, write combining, or no caching at all --
all on the same machine at the same time for different ranges of memory.

Also keep in mind that most modern computers use caching. In a typical
case, any read from or write to main memory happens an entire cache line
at a time. Bookkeeping is also done on the basis of entire cache lines,
so the processor doesn't care how many bits in a cache line have been
modified -- from its viewpoint, the cache line as a whole is either
modified or not. If, for example, another processor attempts to read
memory that falls in that cache line, the entire line is written to
memory before the other processor can read it. Even if the two are
entirely disjoint, if they fall in the same cache line, the processor
treats them as a unit.
 
J

James Kanze

[ ... ]
What I find a bit disconcerting is that it seems so
difficult to find out whether a given hardware actually does
this. Reality seems to confirm that it actually is "most"
(or otherwise "most" programs would probably crash a lot
more than they do), but I haven't found any documentation
about any specific guarantees of specific compilers on
specific platforms. (I'm mainly interested in VC++ and gcc.)
Does somebody have any pointers for me?

It depends mostly on the hardware architecture, not the
compiler. The compiler will generate byte, half-word, etc. load
and store machine instructions (assuming they exist, of course);
the problem is what the hardware does with them.

For Sparc architecture, see
http://www.sparc.org/specificationsDocuments.html. I presume
that other architecture providers (e.g. Intel, AMD, etc.) have
similar pages.

[...]
Also keep in mind that most modern computers use caching. In a
typical case, any read from or write to main memory happens an
entire cache line at a time. Bookkeeping is also done on the
basis of entire cache lines, so the processor doesn't care how
many bits in a cache line have been modified -- from its
viewpoint, the cache line as a whole is either modified or
not. If, for example, another processor attempts to read
memory that falls in that cache line, the entire line is
written to memory before the other processor can read it. Even
if the two are entirely disjoint, if they fall in the same
cache line, the processor treats them as a unit.

That's true to a point. Most modern architectures also ensure
cache coherence at the hardware level: if one thread writes to
the first byte in a cache line, and a different thread (on a
different core) writes to the second byte, the hardware will
ensure that both writes eventually end up in main memory; that
the write back of the cache line from one core won't overwrite
the changes made by the other core.

This issue was discussed in detail by the committee; in the end,
it was decided that given something like:

struct S { char a; char b; } ;
or
char a[2] ;

one thread could modify S::a or a[0], and the other S::b or
a[1], without any explicit synchronization, and the compiler had
to make it work. This was accepted because in fact, just
emitting store byte instructions is sufficient for all of the
current architectures.
 
G

Gerhard Fiedler

It depends mostly on the hardware architecture, not the compiler. The
compiler will generate byte, half-word, etc. load and store machine
instructions (assuming they exist, of course); the problem is what the
hardware does with them.

For Sparc architecture, see http://www.sparc.org/specificationsDocuments.html.
I presume that other architecture providers (e.g. Intel, AMD, etc.)
have similar pages.

Thanks. I thought that it would also depend on how the compiler generates
the code, but I guess you're right in assuming that any (halfway decent)
compiler will generate 8-bit writes for 8-bit variables if that is possible
:)
That's true to a point. Most modern architectures also ensure cache
coherence at the hardware level: if one thread writes to the first byte
in a cache line, and a different thread (on a different core) writes to
the second byte, the hardware will ensure that both writes eventually
end up in main memory; that the write back of the cache line from one
core won't overwrite the changes made by the other core.

Taken all this together, it seems that on "most modern architectures" cache
coherency is mostly guaranteed by the hardware, and for example it is not
necessary to use memory barriers or locks for access to volatile boolean
variables that are only read or written (never using a read-modify-write
cycle). Is this correct? What is all this talk about different threads
seeing values out of order about, if the cache coherency is maintained by
the hardware in this way?

Gerhard
 
G

gpderetta

Taken all this together, it seems that on "most modern architectures" cache
coherency is mostly guaranteed by the hardware, and for example it is not
necessary to use memory barriers or locks for access to volatile boolean
variables that are only read or written (never using a read-modify-write
cycle). Is this correct? What is all this talk about different threads
seeing values out of order about, if the cache coherency is maintained by
the hardware in this way?

Cache coherency is not the only part of a system that can reorder load
and stores. Write buffers and OoO machinery are also responsible. Even
x86 which has an otherwise fairly strong memory model, requires for
example StoreLoad memory barriers (i.e. mfence or locked operations).

So, AFAIK the answer is no: in general, and for most compilers, even
volatile is not enough.
 
J

James Kanze

On 2008-06-25 04:58:41, James Kanze wrote: [...]
For Sparc architecture,
seehttp://www.sparc.org/specificationsDocuments.html. I
presume that other architecture providers (e.g. Intel, AMD,
etc.) have similar pages.
Thanks. I thought that it would also depend on how the
compiler generates the code, but I guess you're right in
assuming that any (halfway decent) compiler will generate
8-bit writes for 8-bit variables if that is possible :)

Well, it would be nice if they'd document it. But in practice,
I don't worry too much about a compiler generating code to load
a word, change one byte of it, and then storing it, if the
hardware has a single instruction byte store.
Taken all this together, it seems that on "most modern
architectures" cache coherency is mostly guaranteed by the
hardware, and for example it is not necessary to use memory
barriers or locks for access to volatile boolean variables
that are only read or written (never using a read-modify-write
cycle). Is this correct? What is all this talk about different
threads seeing values out of order about, if the cache
coherency is maintained by the hardware in this way?

Several things. The first, of course, is what we've just been
talking about only concerns a single cache line; the hardware
might not be so careful between cache lines (which results in
multiple physical writes). But the real reason is that reads
and writes, even to the cache, are pipelined in the processor
itself, and can be reordered in the pipeline. Thus, for
example, if we suppose two int's, i and j, both initially 0, and
one processor executes:

store #1, i
store #1, j

a second processor can still see the condition i==0, j==1,
because either the first processor has reordered the writes
(because of pipeline considerations), or because the second
recognized that it already had a read of the cache line with j
in its pipeline, and used the results of that read for j.
 
J

Jerry Coffin

[ ... ]
Taken all this together, it seems that on "most modern architectures" cache
coherency is mostly guaranteed by the hardware, and for example it is not
necessary to use memory barriers or locks for access to volatile boolean
variables that are only read or written (never using a read-modify-write
cycle). Is this correct? What is all this talk about different threads
seeing values out of order about, if the cache coherency is maintained by
the hardware in this way?

Yes and no. The hardware normally ensures coherency for a single
variable -- but it doesn't know anything about the relationships you've
established between variables. For example, assume a really simple
situation where you have some data and a bool to tell when the data is
valid:

struct whatever {
int data1;
float data2;
bool valid;
public:
whatever() : valid(false) {}
} thing;

If you have code like:

thing.data1 = 1;
thing.data2 = 2.0f;
thing.valid = true;

The hardware will assure that when a write has taken place to any of the
variables, any other core looking at the memory location of that
variable will see the value that was written.

Now, we don't care at all about the relative order in which data1 and
data2 are written -- whichever way the hardware can do it the fastest is
fine by us. BUT we need to assure that 'valid' is only see as true AFTER
the values have been written to both data1 and data2.

The hardware doesn't know this on its own. It just sees three separate
assignments to three separate variables. As such, the programmer needs
to "inform" the hardware about the relationship involved.
 
G

Gerhard Fiedler

Taken all this together, it seems that on "most modern architectures"
cache coherency is mostly guaranteed by the hardware, and for example
it is not necessary to use memory barriers or locks for access to
volatile boolean variables that are only read or written (never using a
read-modify-write cycle). Is this correct? What is all this talk about
different threads seeing values out of order about, if the cache
coherency is maintained by the hardware in this way?

Yes and no. [Lots of useful stuff snipped.]

Thanks to all who responded in this thread. It has helped me a good deal in
understanding what I can rely on and what not.

Gerhard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,052
Latest member
KetoBeez

Latest Threads

Top