[ ... ]
That's why caching is used.
Yes, but what you're advocating would negate (most of) the benefits of
the cache.
Specifically, if you keep the memory in synch with the contents of the
cache, you have a write-through cache.
Most modern systems use write-back caches. These allow the system
memory to become out of synch with the contents of the cache. In this
situation, the cache typically marks the cache line as modified.
In a machine that wants to do so, it's still possible to provide a
coherent view of memory with a write-back cache though. All the
processors snoop all bus transactions. When there's a read transaction
on a location that's held in a modified cache line, the processor puts
the transaction on hold, then writes its modified cache line back to
memory, and finally allows the original bus transaction to complete.
This works well for a small number of processors (e.g. 8 or fewer
processors). You can extend it a little bit by adding some direct
processor-processor links, so instead of writing the data to memory, and
then the second processor reading it back from memory, the processor
with the modified line sends it directly to the processor that needs it.
Beyond a few dozen or so processors, that starts to break down again.
If you want to support a really large number of processors (e.g. tens of
thousands) you nearly NEED to decouple the processors from each other to
a greater degree. In the process, you essentially always end up with a
situation where processors do not have a coherent view of memory unless
you do something special to cause it (and you generally want to avoid
that something special, because it's almost inevitably extremely
expensive).
How do you think mutexes work? They rely on the fact that a write is
atomic. Keep in mind that the OP was not asking about a read after
write ("RAW"), only a single write.
No -- a mutex usually works by internally doing things that are non-
portable.
In the end, the bottom line is pretty simple: the C++ standard really
only addresses atomicity in one very limited area -- in the presence of
signals, sig_atomic_t is a type that can be manipulated atomically.
That doesn't guarantee you _anything_ about threads though. For small
scale-multiprocessing, chances are it'll work, but there's never any
guarantee that it will, and beyond a certain scale it's almost
guaranteed to fail (though, if it's any comfort, so is nearly everything
that's portable).