How to make this program more efficient?

B

Bill David

If you don't want it to.  I've used garbage collection with C++,
and it works quite well.  For almost all applications, it's a
definite win in terms of development times and robustness, and
for a lot of applications, it's also a win in terms of
performance, and for others, it's a win in terms of perceived
performance, even if total performance isn't as good.

--
James Kanze (GABI Software)             email:[email protected]
Conseils en informatique orientée objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Thank all of you for your suggestion.
to the following important scenario I have mentioned:
To the original solution based on stl::map, both getData (will call
map::find) and copyData (will iterate the whole map) will face some
errors if we don't lock between all methods that access (read/update)
the map.
To above problem, it seems garbage collection based solution is the
only one that can solve it with more memory consuming. A getData/
copyData request that has started before the update may still use the
original map pointer and content, while requests after update will use
the new one.
To RCU, I am afraid it's not fit for this scenario, although it can be
lock-free sometimes. I have some RCU implementation experience, we add
a timestamp with data, we read data out and try to update it. But
before we really update data to DB, we will check the current
timestamp again. If it's modified, the update request will fail.
Otherwise, we will update data with new timestamp. But how could we
apply it in above scenario, I have no idea.
To RWL, if it can permit concurrent read without problem we may face
in the scenario I have mentioned above, it will also be a great
enhancement to performance. But I am not sure about this solution. I
will think about it in the following hours, and come back again, :)
 
J

Jerry Coffin

[ ... ]
All accesses, including reads, still need to be synchronized.

Well, yes. For the moment, I'm ignoring the read lock because it'll be
needed for _any_ solution that shares access to the map, regardless of
how the updating is done, and an arbitrary number of threads can acquire
read locks simultaneously anyway.
Provided you do it in assembler. std::swap doesn't guarantee
atomicity and synchronization. (And you likely need additional
instructions for synchronization of the swap on any architecture
other than 80x86.)

You can't do it entirely portably, but you rarely (if ever) have to use
assembly language either -- nearly any platform that provides threading
of any sort is also going to include some synchronization primitives,
and one of them is almost certain to be adequate to this purpose. For
one obvious example, on Win32 you'd use InterlockedExchange (or a
variation such as InterlockedExchange64). Under C++ 0x it will be
provided under the name atomic_exchange().
 
J

Jerry Coffin

[ ... ]
You'll have to explain that one to me. (I agree that this is a
pattern in which reference counting works extremely well, but
I'm pretty sure you still need more than a single bit to
implement it.)

As I said, it depends on the usage -- in a pretty fair number of cases,
a thread isn't capable of creating more than one reference to the
contents of the collection at any given time. When that's the case, the
maximum reference count is one...
 
J

James Kanze

[ ... ]
You'll have to explain that one to me. (I agree that this
is a pattern in which reference counting works extremely
well, but I'm pretty sure you still need more than a single
bit to implement it.)
As I said, it depends on the usage -- in a pretty fair number
of cases, a thread isn't capable of creating more than one
reference to the contents of the collection at any given time.
When that's the case, the maximum reference count is one...

But in that case, you don't need a reference count at all. Any
time you would decrement the reference count, you're guaranteed
that the result will be 0, and you'll have to delete, so you
might as well just delete immediately, without bothering with
the reference count.
 
J

Jerry Coffin

[ ... ]
But in that case, you don't need a reference count at all. Any
time you would decrement the reference count, you're guaranteed
that the result will be 0, and you'll have to delete, so you
might as well just delete immediately, without bothering with
the reference count.

No so -- each thread has only a single use, but an arbitrary number of
threads can have access, so you need to wait until all the threads clear
their bits before you dispose of the map.
 
J

James Kanze

[ ... ]
But in that case, you don't need a reference count at all.
Any time you would decrement the reference count, you're
guaranteed that the result will be 0, and you'll have to
delete, so you might as well just delete immediately,
without bothering with the reference count.
No so -- each thread has only a single use, but an arbitrary
number of threads can have access, so you need to wait until
all the threads clear their bits before you dispose of the
map.

So it's one bit per thread, rather than a single bit. So for n
threads, you need n bits, rather than just lg(n).

There's something I'm not understanding.
 
J

Jerry Coffin

[ ... ]
So it's one bit per thread, rather than a single bit. So for n
threads, you need n bits, rather than just lg(n).

There's something I'm not understanding.

Yes -- my original article was rather misleading, but this does not
optimize space. Rather, it optimizes speed at the expense of a little
extra space. That is, you avoid all your reader threads contending for a
single resource (the reference count). Instead, each has its own
"reference count" and only has to contend with the writer thread for its
use. From a viewpoint of space utilization this typically won't be any
different from having a full-blown reference count for each thread --
i.e. you want to store the bits unpacked so each word can be accessed
independently of the others.

With a reference count, the a single use of the map (or whatever
resource you're sharing) involves the following steps:

1) obtain mutex on reference count.
2) read, modify and write the reference count.
3) release the mutex on the reference count.
4) Use the shared resource.
5) obtain mutex on reference count.
6) read, modify and write the reference count.
7) release the mutex on the reference count.

With a mutex for each thread the cycle is:

1) set the mutex
2) Use the shared resource
3) clear the mutex

The number of steps can be somewhat misleading though: steps 2 and 6 of
the original set each involve an RMW cycle by themselves, which is
completely absent in the second set. In addition, obtaining the mutex
(steps 1 and 5) takes longer as the number of threads grows, another
feature that's absent from the second set.
 
J

James Kanze

[ ... ]
So it's one bit per thread, rather than a single bit. So
for n threads, you need n bits, rather than just lg(n).
There's something I'm not understanding.
Yes -- my original article was rather misleading, but this
does not optimize space. Rather, it optimizes speed at the
expense of a little extra space. That is, you avoid all your
reader threads contending for a single resource (the reference
count). Instead, each has its own "reference count" and only
has to contend with the writer thread for its use. From a
viewpoint of space utilization this typically won't be any
different from having a full-blown reference count for each
thread -- i.e. you want to store the bits unpacked so each
word can be accessed independently of the others.

You mean you have to store the bits unpacked, if it is to
work:). When you use bit fields (or do the masking by hand),
the compiler generates a read/modify/write cycle, with no
intervening locks or synchronization. (On some machines, e.g.
early Dec Alphas, even if you use a complete byte per bit, the
hardware would transform this into a read/modify/write, with no
intervening locks or synchronization. From what I understand,
the upcoming thread handling in the standard will not allow
this, however.)

I'm still not too sure how your scheme works. In order to know
whether to free or not, you have to access the "bits" written by
the other threads, which means some sort of synchronization is
necessary.

And of course, this could increase memory use enormously, if,
say, you had a couple of thousand threads. (One of the servers
I currently work on regularly runs with several hundred threads,
and the number is growing.)
With a reference count, the a single use of the map (or
whatever resource you're sharing) involves the following
steps:
1) obtain mutex on reference count.
2) read, modify and write the reference count.
3) release the mutex on the reference count.
4) Use the shared resource.
5) obtain mutex on reference count.
6) read, modify and write the reference count.
7) release the mutex on the reference count.
With a mutex for each thread the cycle is:
1) set the mutex
2) Use the shared resource
3) clear the mutex
The number of steps can be somewhat misleading though: steps 2
and 6 of the original set each involve an RMW cycle by
themselves, which is completely absent in the second set. In
addition, obtaining the mutex (steps 1 and 5) takes longer as
the number of threads grows, another feature that's absent
from the second set.

Most modern machines have the necessary instructions to support
a non-locking atomic increment or decrement. When the reference
count schema is used, this can be used instead of the mutexes;
with reference counting and swapping pointers, I'm pretty sure
that I could get the whole thing to work on a modern (v9 or
later) Sparc without a single mutex. It would take a little bit
of assembler, however. Even with relaxed memory order.
 
J

James Kanze

Or the Bible, which is also irrelevant.
Any major ones?

DEC Alpha, Sun Sparc, and Intel 64 bits, that I know of.
Probably most others as well.
That is incorrect. There are, of course, already guarantees
such as the one given by .NET as I have already said.

Ensuring that guararantee is going to make programs run almost
an order of magnitude slower.
That is just another serious limitation of the new C++.

That's because it's a limitation of real hardware.
 
C

courpron

[...]
That's because it's a limitation of real hardware.

Another misconception.

Not at all.

Accessing an (aligned) pointer atomically can still lead to some
problems (the same problems you can have with the volatile keyword).
Basically, on a SMP architecture :
- instructions can be reordered at runtime ; and there is no
dependence analysis as with parallel instructions in a susperscalar
monoprocessor.
- the cache of the concerned CPU may be outdated (even in a cache-
coherent architecture, depending on the cache-coherent system)

In such context, the only portable way to access safely a memory word
is to use memory barriers.

Alexandre Courpron.
 
P

peter koch

[...]
But that's not the issue; the issue is one of ordering.  And
the upcoming C++ standard will be clear here: if more than
one thread accesses the container, and any thread modifies
it, then all accesses must be synchronized.
That is just another serious limitation of the new C++.
That's because it's a limitation of real hardware.
Another misconception.

Not at all.
[snip]

In such context, the only portable way to access safely a memory word
is to use memory barriers.

Or use a mutex. In both cases, synchronisation takes place - which I
believe must be evident for everybody except perhaps for Jon Harrop.

/Peter
 
C

courpron

[...]
But that's not the issue; the issue is one of ordering.  And
the upcoming C++ standard will be clear here: if more than
one thread accesses the container, and any thread modifies
it, then all accesses must be synchronized.
That is just another serious limitation of the new C++.
That's because it's a limitation of real hardware.
Another misconception.
Not at all.
[snip]

In such context, the only portable way to access safely a memory word
is to use memory barriers.

Or use a mutex. In both cases, synchronisation takes place - which I
believe must be evident for everybody except perhaps for Jon Harrop.

Generally a mutex uses internally memory barriers, except if this is
not needed for the targeted architecture (in those architectures, a
"naked" access to a memory word is also probably safe).

Alexandre Courpron.
 
P

peter koch

On 20 Sep., 15:47, (e-mail address removed) wrote:
[...]
But that's not the issue; the issue is one of ordering.  And
the upcoming C++ standard will be clear here: if more than
one thread accesses the container, and any thread modifies
it, then all accesses must be synchronized.
That is just another serious limitation of the new C++.
That's because it's a limitation of real hardware.
Another misconception.
Not at all.
In such context, the only portable way to access safely a memory word
is to use memory barriers.
Or use a mutex. In both cases, synchronisation takes place - which I
believe must be evident for everybody except perhaps for Jon Harrop.

Generally a mutex uses internally memory barriers, except if this is
not needed for the targeted architecture (in those architectures, a
"naked" access to a memory word is also probably safe).

Of course it might, but except for the implementors, there's really no
need to care.

/Peters
 
C

courpron

On 20 Sep., 15:47, (e-mail address removed) wrote:
[...]
But that's not the issue; the issue is one of ordering.  And
the upcoming C++ standard will be clear here: if more than
one thread accesses the container, and any thread modifies
it, then all accesses must be synchronized.
That is just another serious limitation of the new C++.
That's because it's a limitation of real hardware.
Another misconception.
Not at all.
[snip]
In such context, the only portable way to access safely a memory word
is to use memory barriers.
Or use a mutex. In both cases, synchronisation takes place - which I
believe must be evident for everybody except perhaps for Jon Harrop.
Generally a mutex uses internally memory barriers, except if this is
not needed for the targeted architecture (in those architectures, a
"naked" access to a memory word is also probably safe).

Of course it might, but except for the implementors, there's really no
need to care.

I brought this precison to show that memory barriers are ultimately
needed, because what lacks to an atomic access is "memory
synchronization". If the implementation of a mutex on a specific
architecture doesn't need memory barriers then a simple atomic access
probably doesn't need memory barriers too.

Alexandre Courpron.
 
C

courpron

That is not true in general.

The answer to this is contained in the part of my message you cut (i.
e. : "If the implementation of a mutex on a specific architecture
doesn't need memory barriers then a simple atomic access probably
doesn't need memory barriers too." )

Alexandre Courpron.
 
C

courpron

Only if you add portability as a new requirement. That is extremely and
needlessly inefficient on platforms which support atomic pointer writes
though (which ARM, AMD and Intel all do). In practice, you would use
specialized code generation whereever possible because it will generate
vastly faster code. That is exactly what the JVM and CLR do, of course.
They are not going to put a lock around every pointer write if they can
possibly avoid it...

You are making some confusion about the atomicity of an operation and
the fact that such operation, **even if they are atomic**, still needs
to be *memory* synchronized ,at least on SMP architectures, to
overcome the ordering and cache problems which I described earlier.

Alexandre Courpron.
 
J

James Kanze

In this context (aligned pointers) that is incorrect.

Not according to the processors specifications. I've seen
actual errors due to it as well.
Note that .NET already runs on x64 and is not an "order of
magnitude slower" precisely because your assertions are
incorrect.

Note that you've almost certainly misunderstood the
specifications of .NET. I'd be very surprised if they were
significantly different than those of Java.
Another misconception.

So you know more about the Sparc architecture than the people
who actually designed it.
 
C

courpron

Within the confines of your assumptions I agree but, in general, that is not
true

What is exactly wrong ? that you get the ordering and caching problem
on most SMP architectures ?

Alexandre Courpron.
 
C

courpron

You are making some confusion about the atomicity of an operation and
the fact that such operation, **even if they are atomic**, still needs
to be *memory* synchronized ,at least on SMP architectures, to
overcome the ordering and cache problems which I described earlier.

[...] the distinction can be useful.

Do you mean the distinction between atomic operations and memory
synchronization ?
You did not make it when you say for example, about memory barriers :

Did you know that all of those CPUS, in a SMP configuration, have a
weak memory ordering model ?

Alexandre Courpron
 
J

Jerry Coffin

[ ... ]
I'm still not too sure how your scheme works. In order to know
whether to free or not, you have to access the "bits" written by
the other threads, which means some sort of synchronization is
necessary.

Yes, there's still synchronization, but it's relatively low contention
(as a rule) -- in particular, each of the reading threads is competing
with only the one writing thread, whereas with a reference count, all
the readers compete with each other for access to the reference count.
And of course, this could increase memory use enormously, if,
say, you had a couple of thousand threads. (One of the servers
I currently work on regularly runs with several hundred threads,
and the number is growing.)

Yes, with a huge number of threads the memory use would clearly be
higher -- OTOH, machines capable of reasonably supporting thousands of
threads (and having them run often enough to care about an optimization
like this) are also likely to be able to spare the memory to support it.

[ ... ]
Most modern machines have the necessary instructions to support
a non-locking atomic increment or decrement. When the reference
count schema is used, this can be used instead of the mutexes;
with reference counting and swapping pointers, I'm pretty sure
that I could get the whole thing to work on a modern (v9 or
later) Sparc without a single mutex. It would take a little bit
of assembler, however. Even with relaxed memory order.

The architecture can make the locking more or less invisible, but in the
end, you still have a read/modify/write cyle that has to be atomic --
which means (at the very least) locking that memory location for long
enough to read, modify and write the data.

Interestingly, the layout I'm talking about will generally work
correctly even on an architecture with virtually _no_ support for atomic
operations -- modifying one bit in a word will normally be atomic
regardless, simply because it's essentially impossible to spread a write
fo a single bit over more than one clock cycle.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,079
Latest member
ElidaWarin

Latest Threads

Top