a fast malloc/free implementation & benchmarks

Discussion in 'C Programming' started by orz, Mar 20, 2011.

  1. orz

    orz Guest

    Scalable High Speed Heap Project

    =================================================
    1. What is the Scalable High Speed Heap Project?
    =================================================

    The Scalable High Speed Heap Project attempts to provide:

    A. an implementation of malloc/free (aka a memory manager, aka a heap)
    that performs well in all (or nearly all) cases, is available on an
    unrestrive license, and does not suffer from any major flaws

    B. a set of benchmarks for memory managers that produces reproducable
    measures of performance for a wide variety of usage cases

    Note the word "attempts" there - this project still has a ways to go.
    See section #3 for more on what this project has already achieved, and
    where it is lacking at the moment.

    This project is written in C++, but the interface is exported as C so
    it should be easy to use from a variety of languages.

    The most recent release of the Scalable High Speed Heap Project can be
    downloaded from sourceforge at:
    http://sourceforge.net/projects/shshp/files/

    =================================================
    2. Why is memory manager performance important?
    =================================================

    Dynamic memory allocation / deallocation happens very frequently in
    many programs. A large fraction of operations on strings and dynamic
    data structures (lists, trees, tables, etc) involve allocating or
    freeing heap memory. Because many programming languages and libraries
    go to considerable lengths to hide or automate various forms of
    complexity (much of which involves heap operations), you might not
    even be aware of the majority of heap operations occurring in your
    code.

    Given that, one might expect the authors of compilers, standard
    libraries, and OSes to put a great deal of effort in to optimizing
    their memory managers. But some don't. And even those that do
    optimize their memory managers don't always optimize them for the
    right thing. A change in allocation patterns can produce a 100x
    difference in performance on many common memory managers.

    =================================================
    3. How is the Scalable High Speed Heap Project so far?
    =================================================

    The benchmarks produce reproducable measure of performance that show
    nuanced pictures of various heaps performance. But they are missing
    many known problem cases, and quite possibly a huge number of not yet
    known problem cases. Their biggest single failing at the moment is
    that they measure only the performance of small allocation sizes at
    the moment - 4 bytes to 259 bytes.

    The heap included achieves some significant goals already - it is
    dramatically faster than any other heap tested on these benchmarks,
    and it is available under an unrestrictive license. However, it still
    has numerous flaws, including:
    - too prone to memory fragmentation
    - not really portable yet
    - too slow for larger allocations
    - never gives memory back to the operating system
    - not fast enough in 8+ thread situations
    - may be prone to threads piling up behind mutexes during system
    calls
    - no support for performance monitoring & debugging helpers

    In the next few releases I plan to improve its memory fragmentation
    behavior dramatically, and to make it give unused memory back to the
    OS when possible. After that I plan to make it scale better to very
    large thread counts (8-256), and to do large allocations quickly in
    addition to small allocations. I also hope to get it portable enough
    to run on almost any unix-like environment in addition to windows.

    =================================================
    4. How do various different memory managers perform on the SHSHP
    benchmarks?
    =================================================

    For the long answer, see section #4. The short answer is this grid
    here:

    BENCHMARK SUMMARY:
    (heap name) 1-thread 2-thread 4-thread 8-thread
    XP default heap 7.8 12.3 9.0 7.4
    FastMM4 17.0 1.0 0.1 0.2
    nedmalloc 27.1 23.5 0.2 0.5
    tbbmalloc 26.7 33.7 15.5 8.9
    ThreadHeap4 40.2 58.9 35.9 24.3
    results are in millions of malloc-free pairs per second, measured on a
    dual-core Core2 @ 3.0 GHrz

    The memory managers benchmarked here are:

    XP default heap: Most programs end up using this when they run on XP.
    Performance is generally poor. Best-case scalability is poor, but
    because worst-case scalability is decent these benchmarks consider its
    overall scalability to be good. Performance on real-world multi-
    threaded programs shows some major quirks that can drive performance
    down drastically, but these benchmarks miss that issue somehow.

    FastMM4: This is the default memory manager for programs made with
    some Borland tools. Performance is decent for single-threaded code,
    but bad for multi-threaded code. There are some odd performance
    quirks, if you don't hit any of those then overall performance can be
    decent.

    nedmalloc: I'm not sure what platforms (if any) use this heap by
    default. Performance is generally pretty good, but there are some
    quirks that drastically reduce performance on the multi-threaded
    benchmarks.

    tbbmalloc: This is the heap included in the Thread Building Blocks
    project by Intel. Performance is generally pretty good. This heap
    does not natively support large allocation sizes, but passes those
    through to another heap (FastMM4 in at least some cases).

    ThreadHeap4: This heap is of my own design. The most recent version
    of this was written specifically to maximize performance on these
    benchmarks and be released as part of the Scalable High Speed Heap
    Project. This heap does not natively support large allocation sizes,
    but passes those through to another heap (the platform native heap at
    the moment). While performance is excellent on this heap (on small
    memory allocations anyway), other aspects of it are sub-par at the
    moment, though that will hopefully be fixed in the next few
    releases.

    Memory managers not benchmarked here include:

    tcmalloc / perftools: This heap is (made? endorsed?) by Google, and
    appears to be reasonably fast.

    jemalloc: I haven't managed to find or build a win32 dll of this yet.
    It is supposedly fairly fast.

    lockless: This heap is sold commercially... I won't be paying to test
    it. It is supposedly quite fast.

    Hoard: This heap seems to be available as GPL (not LGPL) or sold
    commercially, neither of which allows me to include it in this project
    without imposing significant restrictions on this project.

    =================================================
    5. Can you break down the individual benchmarks scores on various
    memory managers?
    =================================================

    Here are the benchmark descriptions and results:

    BENCHMARK DESCRIPTIONS:

    overall: The overall category is based upon assuming that an equal
    number of malloc-free pairs are done in each benchmarks patterns. For
    single-threaded benchmarks only simple, isolated, and swing benchmarks
    are used, but for multithreaded tests all benchmarks are used.

    simple: In this benchmark each thread allocates a memory block, then
    frees it, over and over again. The size of allocations vary randomly
    from 4 to 259 bytes (uniform distribution), though consecutive
    allocations tend to be the same size.

    isolated: In this benchmark each thread allocates and frees memory
    blocks in random patterns. A typical allocation will be freed after
    5.4 thousand more allocations happen, but it can be as little as 2.7
    thousand allocations later. The size of allocations vary randomly
    from 4 to 259 bytes (uniform distribution), though consecutive
    allocations tend to be the same size.

    swing: This is similar to the isolated benchmark, but the delay for
    each allocation is freed is longer, causing larger total amounts of
    memory to be allocated at once. Furthermore, once in a while this
    benchmark will deallocate all allocations it has made so far and start
    over, breaking the normal pattern. The overall effect of this is that
    memory allocated by any given thread in this benchmark tends to climb
    rapidly then fall rapidly, over and over again. The size of
    allocations vary randomly from 4 to 259 bytes (uniform distribution),
    though consecutive allocations tend to be the same size.

    unidirectional: This test divides the threads up in to pairs, where
    each pair has one thread allocate memory as fast as it can and the
    other thread deallocate memory as fast as it can. The allocated
    memory is based from the producing thread to the consuming by way of a
    lockless queue. This benchmark tends to produce more contention than
    any other benchmark, as this maximizes the amount of inter-thread
    communication required from the memory manager. Furthermore, some
    benchmarks force each thread to different regions of memory in order
    to minimize cache line splitting (aka false sharing), a process which
    this benchmark can make more complicated. This can also penalize
    heaps that take different amounts of time perform an allocation than a
    deallocation. The size of allocations vary randomly from 4 to 259
    bytes (uniform distribution), though consecutive allocations tend to
    be the same size.

    bidirectional: This test divides the threads up in to pairs. Each
    thread allocates memory, sends it to the other half of its pair, and
    deallocates memory it recieves from the other half of its pair. The
    allocated memory is based from the producing thread to the consuming
    by way of a lockless queue. Some benchmarks force each thread to
    different regions of memory in order to minimize cache line splitting
    (aka false sharing), a process which this benchmark can make more
    complicated. The size of allocations vary randomly from 4 to 259
    bytes (uniform distribution), though consecutive allocations tend to
    be the same size.

    chain: In this test, each thread does a a small amount of allocations
    and deallocations in a pattern broadly similar to the "isolated"
    benchmark. Then it creates a new thread and terminates itself. The
    new thread does the same thing, ad nauseum. This tends to stress
    heaps that spend a long time constructing or destructing per-thread
    data structures.

    BENCHMARK RESULTS:
    results are in millions of malloc-free pairs per second, measured on a
    dual-core Core2 @ 3.0 GHrz

    SINGLE-THREADED TESTS:
    (heap name) simple isol. swing overall
    platform default 10.2 9.4 5.5 7.8
    FastMM4 12.6 20.3 21.1 17.0
    nedmalloc 30.9 26.9 24.4 27.1
    tbbmalloc 35.4 22.8 24.7 26.7
    ThreadHeap4 46.8 37.8 37.4 40.2

    TWO-THREAD TESTS:
    (heap name) simple isol. swing unidir bidir chain overall
    platform default 10.6 16.2 8.1 15.2 16.5 12.4 12.3
    FastMM4 11.5 21.0 21.4 0.4 0.3 13.6 1.0
    nedmalloc 48.4 52.0 47.8 10.4 42.2 13.3 23.5
    tbbmalloc 62.4 44.0 47.2 22.9 30.0 24.2 33.7
    ThreadHeap4 94.0 76.0 69.1 57.4 65.5 32.4 58.9

    FOUR-THREAD TESTS:
    (heap name) simple isol. swing unidir bidir chain overall
    platform default 10.8 16.2 8.0 8.2 5.4 12.8 9.0
    FastMM4 23.8 22.1 21.7 0.0 0.0 14.1 0.1
    nedmalloc 61.6 53.0 48.0 0.0 2.6 13.0 0.2
    tbbmalloc 60.5 42.0 46.9 6.9 7.2 24.6 15.5
    ThreadHeap4 93.9 76.5 69.6 38.0 10.0 33.9 30.9

    EIGHT-THREAD TESTS:
    (heap name) simple isol. swing unidir bidir chain overall
    platform default 10.6 16.2 8.0 3.7 5.6 12.6 7.4
    FastMM4 25.6 23.3 22.4 0.0 0.1 14.8 0.2
    nedmalloc 54.0 51.6 47.0 0.1 2.7 12.7 0.5
    tbbmalloc 62.7 42.0 46.5 2.7 4.9 24.5 8.9
    ThreadHeap4 93.4 75.3 69.6 8.0 18.5 32.7 24.3

    Each number in the grid represents the performance of a memory manager
    on a benchmark, in millions of malloc-free pairs per second. Each row
    corresponds to a memory manager (an implementation of malloc/free aka
    new/delete) while each column corresponds to a specific benchmark for
    measuring heap performance.
    orz, Mar 20, 2011
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John
    Replies:
    13
    Views:
    684
  2. ravi
    Replies:
    0
    Views:
    439
  3. Peter
    Replies:
    34
    Views:
    1,916
    Richard Tobin
    Oct 22, 2004
  4. kj
    Replies:
    4
    Views:
    730
    Jonathan Bartlett
    Dec 6, 2004
  5. orz
    Replies:
    0
    Views:
    1,353
Loading...

Share This Page