performance of freestore management

Y

yonil

Over the years of using C++ I've begun noticing that freestore
management functions (malloc/free) become performance bottlenecks in
complex object-oriented libraries. This is usually because these
functions acquire a mutex lock on the heap. Since the software I'm
writing is targetted for a number of embedded platforms as well as the
PC, it's somewhat difficult to use anything but the standard
implementation given with the compiler.

I've noticed the performance of even STL classes such as std::list is
bottlenecked by allocators. On the VS.net 2005 in particular,
std::allocator uses malloc even for small allocations, which apparently
has an overhead of over 30 bytes per block (in release mode), so
list<int> takes about 40 bytes per node. Ouch. Aside from all mutex
locking mentioned previously, wasting memory in this manner also
reduces performance because the cache is much less efficiently
utilized. RAM access is usually the next worst bottleneck, because the
gap between CPU power to memory bandwidth seems only to increase over
time.

Since I'm actively attempting to implement useful design techniques
such as RAII, my code is increasingly rife with allocation of resource
management objects. shared_ptr's also double the amount of freestore
allocations (since they also allocate a reference counter).

I've found some remedy to these problems in boost::fast_pool_alloc and
in particular when instantiated using null_mutex. This can virtually
eliminate the abovementioned problems with STL containers and
shared_ptr, but in a multithreaded program you sometimes still can't
get rid of the lock. Looking into the future, it would seem smart to
implement multithreaded architectures for high-performance apps,
because of the developments in multi-core hardware. This makes it ever
more important to keep your program as lock-free as you can.
Ironically, it seems it only gets harder to code high performance apps
these days. I miss the times you could focus on your inner loop calcs
and lookup tables actually made things faster - running time was much
more closely related to the academic notion of complexity than it is
nowadays.

Maybe I'm just a little too concerned over these issues, I don't know
 
P

peter koch

yonil said:
Over the years of using C++ I've begun noticing that freestore
management functions (malloc/free) become performance bottlenecks in
complex object-oriented libraries. This is usually because these
functions acquire a mutex lock on the heap.

I assume you have profiling to back-up that statement? Also this
depends on the implementation, of course.
Since the software I'm
writing is targetted for a number of embedded platforms as well as the
PC, it's somewhat difficult to use anything but the standard
implementation given with the compiler.

Not at all. You are allowed to use your own allocator. Write your own,
find a free one or buy one. There are several out there.
I've noticed the performance of even STL classes such as std::list is
bottlenecked by allocators.
This comes as a surprise to me. First surprise is that you need
std::list at all - this class is rarely the right choice. The second
surprise is that the bottleneck is the allocator.
On the VS.net 2005 in particular,
std::allocator uses malloc even for small allocations, which apparently
has an overhead of over 30 bytes per block (in release mode), so
list<int> takes about 40 bytes per node. Ouch. Aside from all mutex
locking mentioned previously, wasting memory in this manner also
reduces performance because the cache is much less efficiently
utilized. RAM access is usually the next worst bottleneck, because the
gap between CPU power to memory bandwidth seems only to increase over
time.

You should go to a microsoft group to get quality response to the
questions posed above.
Since I'm actively attempting to implement useful design techniques
such as RAII, my code is increasingly rife with allocation of resource
management objects. shared_ptr's also double the amount of freestore
allocations (since they also allocate a reference counter).
I've found some remedy to these problems in boost::fast_pool_alloc and
in particular when instantiated using null_mutex. This can virtually
eliminate the abovementioned problems with STL containers and
shared_ptr, but in a multithreaded program you sometimes still can't
get rid of the lock.
You could get rid of some locks by using lock-free stuff where
appropriate.
Apart from that, it is wisest to avoid to intensive data-sharing when
multithreading - but this is again off-topic.
Looking into the future, it would seem smart to
implement multithreaded architectures for high-performance apps,
because of the developments in multi-core hardware. This makes it ever
more important to keep your program as lock-free as you can.
Ironically, it seems it only gets harder to code high performance apps
these days. I miss the times you could focus on your inner loop calcs
and lookup tables actually made things faster - running time was much
more closely related to the academic notion of complexity than it is
nowadays.

Maybe I'm just a little too concerned over these issues, I don't know

/Peter
 
P

Phlip

yonil said:
Since I'm actively attempting to implement useful design techniques
such as RAII, my code is increasingly rife with allocation of resource
management objects. shared_ptr's also double the amount of freestore
allocations (since they also allocate a reference counter).

Why not put objects on the stack, or inside other objects?
Maybe I'm just a little too concerned over these issues, I don't know

Did you profile your application and find allocation is a bottleneck?
 
Y

yonil

Why not put objects on the stack, or inside other objects?

I do it whenever possible, though sometimes it just can't be helped,
and I must use the heap. I'm really trying to think of how to minimize
heap usage these days whenever I'm concerned with performance.
Did you profile your application and find allocation is a bottleneck?

I've actually had trouble profiling my C++ app because it has multiple
threads, and the collector only tracks how much real time is spent
inside each function - not how much CPU time was actually wasted in
running it. If you have a thread that waits on a semaphore 99% of the
time, any function in the call stack during the wait will appear to the
profiler to be extremely time consuming, somewhat skewing the
statistics. Context switching may create arbitrary peaks in the time
apparently spent inside all sorts of unimportant functions, and the
exact behavour depends on the synchronization state of the program.
It's quite difficult for me to give exact numbers how much of the CPU
time is wasted purely because of locks, or because of
allocation-related locks. I do see alot of CPU % going to ntdll, and
that usually indicates a synchronization bottleneck.

So yea, much of what I said is speculation on my part. I think parhaps
the real problem is that I don't have a good profiling tool for
multithreaded apps :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top