a fast malloc/free implementation & benchmarks

orz · Mar 20, 2011

Scalable High Speed Heap Project

=================================================
1. What is the Scalable High Speed Heap Project?
=================================================

The Scalable High Speed Heap Project attempts to provide:

A. an implementation of malloc/free (aka a memory manager, aka a heap)
that performs well in all (or nearly all) cases, is available on an
unrestrive license, and does not suffer from any major flaws

B. a set of benchmarks for memory managers that produces reproducable
measures of performance for a wide variety of usage cases

Note the word "attempts" there - this project still has a ways to go.
See section #3 for more on what this project has already achieved, and
where it is lacking at the moment.

This project is written in C++, but the interface is exported as C so
it should be easy to use from a variety of languages.

The most recent release of the Scalable High Speed Heap Project can be
downloaded from sourceforge at:
http://sourceforge.net/projects/shshp/files/

=================================================
2. Why is memory manager performance important?
=================================================

Dynamic memory allocation / deallocation happens very frequently in
many programs. A large fraction of operations on strings and dynamic
data structures (lists, trees, tables, etc) involve allocating or
freeing heap memory. Because many programming languages and libraries
go to considerable lengths to hide or automate various forms of
complexity (much of which involves heap operations), you might not
even be aware of the majority of heap operations occurring in your
code.

Given that, one might expect the authors of compilers, standard
libraries, and OSes to put a great deal of effort in to optimizing
their memory managers. But some don't. And even those that do
optimize their memory managers don't always optimize them for the
right thing. A change in allocation patterns can produce a 100x
difference in performance on many common memory managers.

=================================================
3. How is the Scalable High Speed Heap Project so far?
=================================================

The benchmarks produce reproducable measure of performance that show
nuanced pictures of various heaps performance. But they are missing
many known problem cases, and quite possibly a huge number of not yet
known problem cases. Their biggest single failing at the moment is
that they measure only the performance of small allocation sizes at
the moment - 4 bytes to 259 bytes.

The heap included achieves some significant goals already - it is
dramatically faster than any other heap tested on these benchmarks,
and it is available under an unrestrictive license. However, it still
has numerous flaws, including:
- too prone to memory fragmentation
- not really portable yet
- too slow for larger allocations
- never gives memory back to the operating system
- not fast enough in 8+ thread situations
- may be prone to threads piling up behind mutexes during system
calls
- no support for performance monitoring & debugging helpers

In the next few releases I plan to improve its memory fragmentation
behavior dramatically, and to make it give unused memory back to the
OS when possible. After that I plan to make it scale better to very
large thread counts (8-256), and to do large allocations quickly in
addition to small allocations. I also hope to get it portable enough
to run on almost any unix-like environment in addition to windows.

=================================================
4. How do various different memory managers perform on the SHSHP
benchmarks?
=================================================

For the long answer, see section #4. The short answer is this grid
here:

BENCHMARK SUMMARY:
(heap name) 1-thread 2-thread 4-thread 8-thread
XP default heap 7.8 12.3 9.0 7.4
FastMM4 17.0 1.0 0.1 0.2
nedmalloc 27.1 23.5 0.2 0.5
tbbmalloc 26.7 33.7 15.5 8.9
ThreadHeap4 40.2 58.9 35.9 24.3
results are in millions of malloc-free pairs per second, measured on a
dual-core Core2 @ 3.0 GHrz

The memory managers benchmarked here are:

XP default heap: Most programs end up using this when they run on XP.
Performance is generally poor. Best-case scalability is poor, but
because worst-case scalability is decent these benchmarks consider its
overall scalability to be good. Performance on real-world multi-
threaded programs shows some major quirks that can drive performance
down drastically, but these benchmarks miss that issue somehow.

FastMM4: This is the default memory manager for programs made with
some Borland tools. Performance is decent for single-threaded code,
but bad for multi-threaded code. There are some odd performance
quirks, if you don't hit any of those then overall performance can be
decent.

nedmalloc: I'm not sure what platforms (if any) use this heap by
default. Performance is generally pretty good, but there are some
quirks that drastically reduce performance on the multi-threaded
benchmarks.

tbbmalloc: This is the heap included in the Thread Building Blocks
project by Intel. Performance is generally pretty good. This heap
does not natively support large allocation sizes, but passes those
through to another heap (FastMM4 in at least some cases).

ThreadHeap4: This heap is of my own design. The most recent version
of this was written specifically to maximize performance on these
benchmarks and be released as part of the Scalable High Speed Heap
Project. This heap does not natively support large allocation sizes,
but passes those through to another heap (the platform native heap at
the moment). While performance is excellent on this heap (on small
memory allocations anyway), other aspects of it are sub-par at the
moment, though that will hopefully be fixed in the next few
releases.

Memory managers not benchmarked here include:

tcmalloc / perftools: This heap is (made? endorsed?) by Google, and
appears to be reasonably fast.

jemalloc: I haven't managed to find or build a win32 dll of this yet.
It is supposedly fairly fast.

lockless: This heap is sold commercially... I won't be paying to test
it. It is supposedly quite fast.

Hoard: This heap seems to be available as GPL (not LGPL) or sold
commercially, neither of which allows me to include it in this project
without imposing significant restrictions on this project.

=================================================
5. Can you break down the individual benchmarks scores on various
memory managers?
=================================================

Here are the benchmark descriptions and results:

BENCHMARK DESCRIPTIONS:

overall: The overall category is based upon assuming that an equal
number of malloc-free pairs are done in each benchmarks patterns. For
single-threaded benchmarks only simple, isolated, and swing benchmarks
are used, but for multithreaded tests all benchmarks are used.

simple: In this benchmark each thread allocates a memory block, then
frees it, over and over again. The size of allocations vary randomly
from 4 to 259 bytes (uniform distribution), though consecutive
allocations tend to be the same size.

isolated: In this benchmark each thread allocates and frees memory
blocks in random patterns. A typical allocation will be freed after
5.4 thousand more allocations happen, but it can be as little as 2.7
thousand allocations later. The size of allocations vary randomly
from 4 to 259 bytes (uniform distribution), though consecutive
allocations tend to be the same size.

swing: This is similar to the isolated benchmark, but the delay for
each allocation is freed is longer, causing larger total amounts of
memory to be allocated at once. Furthermore, once in a while this
benchmark will deallocate all allocations it has made so far and start
over, breaking the normal pattern. The overall effect of this is that
memory allocated by any given thread in this benchmark tends to climb
rapidly then fall rapidly, over and over again. The size of
allocations vary randomly from 4 to 259 bytes (uniform distribution),
though consecutive allocations tend to be the same size.

unidirectional: This test divides the threads up in to pairs, where
each pair has one thread allocate memory as fast as it can and the
other thread deallocate memory as fast as it can. The allocated
memory is based from the producing thread to the consuming by way of a
lockless queue. This benchmark tends to produce more contention than
any other benchmark, as this maximizes the amount of inter-thread
communication required from the memory manager. Furthermore, some
benchmarks force each thread to different regions of memory in order
to minimize cache line splitting (aka false sharing), a process which
this benchmark can make more complicated. This can also penalize
heaps that take different amounts of time perform an allocation than a
deallocation. The size of allocations vary randomly from 4 to 259
bytes (uniform distribution), though consecutive allocations tend to
be the same size.

bidirectional: This test divides the threads up in to pairs. Each
thread allocates memory, sends it to the other half of its pair, and
deallocates memory it recieves from the other half of its pair. The
allocated memory is based from the producing thread to the consuming
by way of a lockless queue. Some benchmarks force each thread to
different regions of memory in order to minimize cache line splitting
(aka false sharing), a process which this benchmark can make more
complicated. The size of allocations vary randomly from 4 to 259
bytes (uniform distribution), though consecutive allocations tend to
be the same size.

chain: In this test, each thread does a a small amount of allocations
and deallocations in a pattern broadly similar to the "isolated"
benchmark. Then it creates a new thread and terminates itself. The
new thread does the same thing, ad nauseum. This tends to stress
heaps that spend a long time constructing or destructing per-thread
data structures.

BENCHMARK RESULTS:
results are in millions of malloc-free pairs per second, measured on a
dual-core Core2 @ 3.0 GHrz

SINGLE-THREADED TESTS:
(heap name) simple isol. swing overall
platform default 10.2 9.4 5.5 7.8
FastMM4 12.6 20.3 21.1 17.0
nedmalloc 30.9 26.9 24.4 27.1
tbbmalloc 35.4 22.8 24.7 26.7
ThreadHeap4 46.8 37.8 37.4 40.2

TWO-THREAD TESTS:
(heap name) simple isol. swing unidir bidir chain overall
platform default 10.6 16.2 8.1 15.2 16.5 12.4 12.3
FastMM4 11.5 21.0 21.4 0.4 0.3 13.6 1.0
nedmalloc 48.4 52.0 47.8 10.4 42.2 13.3 23.5
tbbmalloc 62.4 44.0 47.2 22.9 30.0 24.2 33.7
ThreadHeap4 94.0 76.0 69.1 57.4 65.5 32.4 58.9

FOUR-THREAD TESTS:
(heap name) simple isol. swing unidir bidir chain overall
platform default 10.8 16.2 8.0 8.2 5.4 12.8 9.0
FastMM4 23.8 22.1 21.7 0.0 0.0 14.1 0.1
nedmalloc 61.6 53.0 48.0 0.0 2.6 13.0 0.2
tbbmalloc 60.5 42.0 46.9 6.9 7.2 24.6 15.5
ThreadHeap4 93.9 76.5 69.6 38.0 10.0 33.9 30.9

EIGHT-THREAD TESTS:
(heap name) simple isol. swing unidir bidir chain overall
platform default 10.6 16.2 8.0 3.7 5.6 12.6 7.4
FastMM4 25.6 23.3 22.4 0.0 0.1 14.8 0.2
nedmalloc 54.0 51.6 47.0 0.1 2.7 12.7 0.5
tbbmalloc 62.7 42.0 46.5 2.7 4.9 24.5 8.9
ThreadHeap4 93.4 75.3 69.6 8.0 18.5 32.7 24.3

Each number in the grid represents the performance of a memory manager
on a benchmark, in millions of malloc-free pairs per second. Each row
corresponds to a memory manager (an implementation of malloc/free aka
new/delete) while each column corresponds to a specific benchmark for
measuring heap performance.

Benchmarks	116	Nov 6, 2008
Fabric Engine + Python benchmarks	4	Feb 10, 2012
Call for benchmarks: proposals by 30 June	0	Jun 15, 2010
malloc and free	11	Jul 28, 2008
C/DataDraw smokes C++/STL on EDA-like benchmarks	0	Jun 10, 2008
Naive Custom Malloc Implementation	8	Nov 11, 2009
Using malloc/free in a tight loop	9	Nov 17, 2010
A debugging implementation of malloc	41	Jun 24, 2006

a fast malloc/free implementation & benchmarks

orz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads