Malloc Query

Pushpa · May 17, 2011

Hi all,

Many users claim that malloc() and free() provided by the language
library results in considerable
amount of overhead in terms of calling the same. Can we really
eliminate overhead by using
used defined functions/user defined memory manager. Is there any
significant advantage in providing user defined malloc() and free(). I
want to exclude realtime systems or embedded systems which are memory
constrained here.
I am talking about general purpose applications. With a user defined
memory manager can it handle
memory allocation and de-allocation in a much better way than what OS
would have done. Any inputs/comments
on the same are highly appreciated.

Thanks and Regards,
Pushpa

Ben Pfaff · May 17, 2011

Pushpa said:
Many users claim that malloc() and free() provided by the language
library results in considerable
amount of overhead in terms of calling the same. Can we really
eliminate overhead by using
used defined functions/user defined memory manager. Is there any
significant advantage in providing user defined malloc() and
free().

According to an academic study I read some time ago, there is
rarely any significant advantage to special-purpose memory
allocators. In a quick online search, I didn't find a reference
to this paper, but perhaps someone else here has a copy of it.

Joe Pfeiffer · May 17, 2011

Pushpa said:
Hi all,

Many users claim that malloc() and free() provided by the language
library results in considerable
amount of overhead in terms of calling the same. Can we really
eliminate overhead by using
used defined functions/user defined memory manager. Is there any
significant advantage in providing user defined malloc() and free(). I
want to exclude realtime systems or embedded systems which are memory
constrained here.
I am talking about general purpose applications. With a user defined
memory manager can it handle
memory allocation and de-allocation in a much better way than what OS
would have done. Any inputs/comments
on the same are highly appreciated.

I haven't come across the claim that they are especially expensive. You
really aren't going to do much better trying to write a general-purpose
memory allocator. Also, by the way, you seem to be making an assumption
that malloc() and free() are OS calls -- they aren't. They're C library
calls. They go on to call an OS space allocator (something in the brk()
family, last time I checked which was years ago) if they need it.

Now, if you're doing something special, you can write a special-purpose
allocator/deallocator that's optimized for it. A long, long time ago I
wrote a program that required ridiculous numbers of memory allocations
and frees of one particular size object. I sped that program up a lot
by maintaining my own free list of those objects, and writing my own
allocator/deallocator for them, and then only calling malloc() when my
free space pool was empty. The kernel does something similar with its
slab allocator (though not on top of malloc()/free(), of course).

BGB · May 17, 2011

Hi all,

Many users claim that malloc() and free() provided by the language
library results in considerable
amount of overhead in terms of calling the same. Can we really
eliminate overhead by using
used defined functions/user defined memory manager. Is there any
significant advantage in providing user defined malloc() and free(). I
want to exclude realtime systems or embedded systems which are memory
constrained here.
I am talking about general purpose applications. With a user defined
memory manager can it handle
memory allocation and de-allocation in a much better way than what OS
would have done. Any inputs/comments
on the same are highly appreciated.

it actually really depends...

if one doesn't have a good grasp on what they are doing, then a custom
allocator is usually a good deal worse than the native malloc()/free()
WRT overhead and/or performance.

however, special purpose allocators can do well for certain tasks, such
as being faster, or being more compact for tiny objects, but usually
there are trade-offs.

the main reason, in my case, I do a custom allocator is mostly for:
1, I use GC (often both a friend and an enemy at times);
2, I use dynamic typing, and dynamic type-checking and malloc are not
friends;
3, my heap tends to often be dominated by small objects (and so certain
special-cases, such as "cons cells" have special handling, and certain
other special types are not even in addressable memory, as the pointer
itself encodes the value);
....

for example: cons cells are 8 bytes on x86 (they only ever hold 2
pointers), so it makes sense not to waste extra bytes on things like
having a header or padding the object to a multiple of 16 bytes.

also, they can easily end up (in certain types of app, such as those
making lots of use of list-processing) being the single most common
data-type on the heap.

a secondary common data-type is strings, which if one is using
interned/immutable strings, then having compact storage (such as
byte-aligned string tables) may make a good amount of sense (as strings
are also one of those things which can easily eat *lots* of memory,
hence my personal belief that UTF-16 is rarely a sensible option vs
UTF-8 or similar). granted, in other cases, it also makes sense to have
strings in a form where they may either be mutable or are subject to
being freed (it does not make sense to intern character buffers or large
one-off strings, for example...).

but, the custom MM/GC is mostly for GC and dynamic type-checking, rather
than to try to save bytes or be faster (although it may do this in
certain cases).

but, if I were just doing "generic" stuff, then malloc would probably be
fine, and simply trying to use a custom malloc as some perceived way to
make allocating/freeing structs or arrays faster (or save a few bytes)
is, IMO, a clear case of premature optimization...

or such...

Lauri Alanko · May 18, 2011

I don't think there is very often need to shun the standard allocator
for performance reasons. Production malloc implementations are quite
mature and have reasonably good performance for the most common
scenarios.

It's worth noting that whenever you allocate and free some memory, you
are also likely to _do_ something with that memory, possibly quite a
lot. The performance overhead of malloc and free may often be
negligible compared to the actual computation that is done using the
memory.

Yet still, I can see some reasons for preferring custom allocators.

One is space efficiency. Because free() is not passed the size of the
object being freed, the allocator must somehow store this information,
and this may result in space overhead (depending on implementation
details). In particular, if you allocate a large number of very small
objects, the bookkeeping overhead may be significant. A custom
allocator can be written so that the size of an object need not be
stored in memory, if it can be known statically.

The other reason is to improve the quality of the code. Explicit calls
to free() for every allocated object can be tedious to write and it is
easy to get them wrong, resulting in memory leaks or duplicate frees
or whatnot. With a memory pool -based allocator, you don't need to
free every object individually, but instead you just free an entire
pool, along with all the objects allocated from that pool. This style
fits many programs, and it results in code that is cleaner (less
free()-calls sprinkled everywhere) and more correct (you can't
accidentally forget to free a single object).

Of course, sometimes this may not be enough and you may instead need
reference counting or even actual tracing garbage collection. But
again, using more specialized memory allocation systems should be
motivated by the structure of your programs, not by performance
reasons.

Lauri

Eric Sosman · May 18, 2011

Hi all,

Many users claim that malloc() and free() provided by the language
library results in considerable
amount of overhead in terms of calling the same.

Ah, yes, that well-known and reputable authority "many users."
Almost as reliable as "I've heard" and "some say," nearly as
infallible as "it seems." Well, I've heard that some of these
many users say it seems the Moon is made of green cheese, but even
so I'm not planning to head skyward with my Roquefort dressing
recipe.

If you can produce a claim that has substance, we'll perhaps
have something worth talking about. But if all you can come up
with is "many users" and "considerable," well, ...

Can we really
eliminate overhead by using
used defined functions/user defined memory manager.

No. Since any memory manager, no matter by whom it is written
or with what design principles, will have some overhead, it follows
that the only way to eliminate overhead is to use no memory manager.

Is there any
significant advantage in providing user defined malloc() and free().

No. In fact, the attempt to do so yields undefined behavior.

Providing custom-written work-alikes for malloc() et al. may be
beneficial, *if* (1) you have information not available to the
implementors of the "stock" version(s), and (2) you are able to make
effective use of that extra information. Or, of course, if (3) you
are lots smarter than the stock implementors. (This is in fact a
possibility. I recall a vendor whose malloc() was essentially a
copy of the toy implementation in K&R. K&R's version was just fine
for the 64-256K memory spaces of the computers it ran on, but had
one or two teeny-tiny scalability problems with memories four
thousand times bigger...)

I am talking about general purpose applications. With a user defined
memory manager can it handle
memory allocation and de-allocation in a much better way than what OS
would have done. Any inputs/comments
on the same are highly appreciated.

Content-free questions get content-free answers. Be more specific.

Michael Press · May 18, 2011

BGB <[email protected]> said:
it actually really depends...

if one doesn't have a good grasp on what they are doing, then a custom
allocator is usually a good deal worse than the native malloc()/free()
WRT overhead and/or performance.

however, special purpose allocators can do well for certain tasks, such
as being faster, or being more compact for tiny objects, but usually
there are trade-offs.

the main reason, in my case, I do a custom allocator is mostly for:
1, I use GC (often both a friend and an enemy at times);
2, I use dynamic typing, and dynamic type-checking and malloc are not
friends;
3, my heap tends to often be dominated by small objects (and so certain
special-cases, such as "cons cells" have special handling, and certain
other special types are not even in addressable memory, as the pointer
itself encodes the value);
...

for example: cons cells are 8 bytes on x86 (they only ever hold 2
pointers), so it makes sense not to waste extra bytes on things like
having a header or padding the object to a multiple of 16 bytes.

also, they can easily end up (in certain types of app, such as those
making lots of use of list-processing) being the single most common
data-type on the heap.

a secondary common data-type is strings, which if one is using
interned/immutable strings, then having compact storage (such as
byte-aligned string tables) may make a good amount of sense (as strings
are also one of those things which can easily eat *lots* of memory,
hence my personal belief that UTF-16 is rarely a sensible option vs
UTF-8 or similar). granted, in other cases, it also makes sense to have
strings in a form where they may either be mutable or are subject to
being freed (it does not make sense to intern character buffers or large
one-off strings, for example...).

but, the custom MM/GC is mostly for GC and dynamic type-checking, rather
than to try to save bytes or be faster (although it may do this in
certain cases).

but, if I were just doing "generic" stuff, then malloc would probably be
fine, and simply trying to use a custom malloc as some perceived way to
make allocating/freeing structs or arrays faster (or save a few bytes)
is, IMO, a clear case of premature optimization...

or such...

It is difficult to impossible to fully understand
your prose unless the reader already knows this stuff;
and you are responding to somebody who does not know.
The impediments to understanding are the ellipses.
Every ellipsis makes the reader stop and try to
extrapolate what you leave unsaid.

Nobody · May 18, 2011

Many users claim that malloc() and free() provided by the language
library results in considerable amount of overhead in terms of calling
the same. Can we really eliminate overhead by using used defined
functions/user defined memory manager. Is there any significant
advantage in providing user defined malloc() and free(). I want to
exclude realtime systems or embedded systems which are memory
constrained here.
I am talking about general purpose applications. With a user defined
memory manager can it handle memory allocation and de-allocation in a
much better way than what OS would have done. Any inputs/comments
on the same are highly appreciated.

First, malloc() isn't part of the OS; it's part of the standard C library.
The application typically obtains memory from the OS as entire pages, via
e.g. sbrk() or mmap().

The efficiency problems with malloc() don't lie with a specific
implementation but with the interface. The application doesn't tell
malloc() anything beyond the amount of memory required. A custom allocator
can offer the application either more or less control, either of which may
translate to improved performance. E.g.

1. An allocator for objects of a specific size may be much simpler (and
thus faster) than a general-purpose allocator. The downside is that once
the allocator has claimed memory for its heap, any memory returned to that
heap may only be usable for subsequent allocations from that heap;
fragmentation may make it impossible to return memory to the global heap
or to the OS.

2. An allocator which allows the caller to provide additional information
about the nature of the data may be able to allocate memory for "related"
blocks in the same page, reducing the probability of cache misses at the
cost of making allocation and deallocation slower.

3. If the application can predict the lifetime of the data in advance, the
allocator can avoid allocating long-lived blocks around a short-lived
block, reducing heap fragmentation. In the extreme case, the heap becomes
a stack, with no fragmentation at all.

4. If the application can tolerate the data being moved after allocation,
heap fragmentation can be eliminated.

BartC · May 18, 2011

Ben Pfaff said:
According to an academic study I read some time ago, there is
rarely any significant advantage to special-purpose memory
allocators. In a quick online search, I didn't find a reference
to this paper, but perhaps someone else here has a copy of it.

I've just done a test where allocating and freeing large numbers of 4-byte
blocks took twenty times longer using malloc/free (and probably took 2-4
times as much space) than using a special allocator.

How significant this is depends on the application, but I don't think you
can just dismiss such an allocator.

Pushpa · May 18, 2011

BartC said:
I've just done a test where allocating and freeing large numbers of
4-byte blocks took twenty times longer using malloc/free (and probably
took 2-4 times as much space) than using a special allocator.

How significant this is depends on the application, but I don't think
you can just dismiss such an allocator.

Bartc: Are you sure of this? What I really wanted to ask: Is it
reasonable to plan for saving 50% of memory allocation time by rewriting
special functions for same ? If speedup can be 20X this is amazing.

Kind Regards
Pushpa

Ben Bacarisse · May 18, 2011

BartC said:
I've just done a test where allocating and freeing large numbers of
4-byte blocks took twenty times longer using malloc/free (and probably
took 2-4
times as much space) than using a special allocator.

That's interesting, but of limited value unless you can give more
details. Ideally, can you post the code? That way it can tried on
different systems with different C libraries.

At least say what the "special allocator" is. Relative speed up must be
relative to something known, not something mysterious.

How significant this is depends on the application, but I don't think
you can just dismiss such an allocator.

But no one else can consider it either since we don't know what it is.

Uncle Steve · May 18, 2011

That's interesting, but of limited value unless you can give more
details. Ideally, can you post the code? That way it can tried on
different systems with different C libraries.

At least say what the "special allocator" is. Relative speed up must be
relative to something known, not something mysterious.

But no one else can consider it either since we don't know what it is.

It's probably something similar to what I use occasionally:

struct arena_head {
int obj_size;
int arena_size; /* obj_size count */
char * arena;
int free;
int free_list;
};

There's a function that mallocs the header and data area, and
initializes the free list, which in my case is a linked list of
integers stored in the first word of each arena slot. The allocator
comes in two varieties: one that will realloc() the arena
automatically when the free slots are exhausted, and one that just
returns -1 if the arena is exhausted.

#define arena_obj_addr(x, n) ((void *) *(&p->arena[n * x->obj_size]))

arena_qa(struct arena_head *x)
{
int n;

n = x->free_list;

if(n != -1) {
x->free_list = *((int *) arena_obj_addr(x, n));
x->free--;
}

return(n);
}

void arena_free(struct arena_head *p, int n)
{
*((int *) arena_obj_addr(p, n)) = p->free_list;
p->free_list = n;
p->free++;

return;
}

I haven't measured it's performance against malloc() just yet, but it
must be something like an order of magnitude faster than malloc. I
spent some time to make it a little more flexible than the pared-down
version here, such as including an option to clear each slot when it
is freed automatically, and so on. In a debugger, it's fairly easy to
inspect the book-keeping and you can output a diagnostic message if
you destroy the arena while there are still allocated slots, which
identifies memory leaks.

When I read up on the malloc() implementation in glibc (a few years
ago), it stated that the length of the allocation was stored in the
32-bit word immediately prior to the pointer returned by malloc (on
x86) and there were additional padding bytes, probably a maximum of
three on x86 (not sure how true this is today). So, a four-byte object
size would require eight bytes minimum storage. When you're dealing
with many hundreds of thousands or even millions of allocations, that
might be the difference between keeping your data set in memory, or
going into swap space. For even 30% savings in memory, not to mention
speed, this is a real win.

I would propose that this is a simple example of a 'better algorithm'.
malloc() is designed to handle the general case of arbitrary
allocations occurring with arbitrary ordering. When you know you're
going to use many of the same size object it is easy to adapt the
algorithm to use this arrangement, and for this special case it is
fast and compact. What is the counterargument against using this kind
of structure?

Regards,

Uncle Steve

BartC · May 18, 2011

Ben Bacarisse said:
That's interesting, but of limited value unless you can give more
details. Ideally, can you post the code? That way it can tried on
different systems with different C libraries.

Nothing special, just a contrived example of a special allocator, in this
case (since I was short of time) just linearly allocating 100,000 ints one
after another, then freeing them in the same order. Nevertheless, malloc
took much longer. The code I used is given below (actually it wasn't written
in C that's why it's 1-based).

At least say what the "special allocator" is. Relative speed up must be
relative to something known, not something mysterious.

But no one else can consider it either since we don't know what it is.

It depends on how much of an application is spent on these special
allocations. (In my code, I use special allocators for sizes up to 1KB, and
malloc/free beyond that.)

#include <stdio.h>
#include <stdlib.h>

/* Use 1 for malloc/free, or 0 for the simple allocator used here */
#define USEMALLOC 1

#define maxsize 100000
int data[maxsize+1];
int* freelist=0;
int dataindex=0;

int* allocint(void){
int *p;

#if USEMALLOC
return malloc(sizeof (int));
#endif

if (freelist) {
p=freelist;
freelist=(int*)*freelist;
return p;
}
if (dataindex>=maxsize) {
puts("Out of memory");
exit(0);
}
return &data[++dataindex];
}

void freeint(int* p){

#if USEMALLOC
free(p);
return;
#endif
*p=(int)freelist;
freelist=p;
}

int main(void){
int a=0;
int i,n;

for (n=1; n<=100; ++n) {
for (i=1; i<=maxsize; ++i)
allocint();

for (i=1; i<=maxsize; ++i)
freeint(&data);
}
}

Jorgen Grahn · May 18, 2011

On 5/17/2011 3:39 PM, Pushpa wrote: ....

Content-free questions get content-free answers. Be more specific.

What's so content-free about that? At least the part you quote here
is clearly stated, and the answer is of course "no, not unless the
standard malloc implementation is really bad, which is unlikely".

/Jorgen

BartC · May 18, 2011

Pushpa said:
BartC writes:

Bartc: Are you sure of this? What I really wanted to ask: Is it
reasonable to plan for saving 50% of memory allocation time by rewriting
special functions for same ? If speedup can be 20X this is amazing.

Don't get too excited: if your code does nothing except allocate/free blocks
of memory, and most allocations are of a particularly simple type, then
you might get good speedups with a custom allocator.

In practice allocation patterns can be a bit more complex, and you can end
up wasting time doing what malloc/free already does more efficiently than
you can.

You need to do more tests as to what overheads currently are. 50% time
saving is not worthwhile if allocation only takes up 1% of runtime. (And
then remember each implementation will be faster or slower.)

Ike Naar · May 18, 2011

I've just done a test where allocating and freeing large numbers of 4-byte
blocks took twenty times longer using malloc/free (and probably took 2-4
times as much space) than using a special allocator.

How does the special allocator allocate its memory?
Does it use malloc?

BartC · May 18, 2011

Ike Naar said:
How does the special allocator allocate its memory?
Does it use malloc?

I'm talking about allocators that request large blocks via malloc() then
allocate smaller regions within that (although the code I posted elsewhere
just used a static allocation for the large block).

Ben Bacarisse · May 18, 2011

BartC said:
Nothing special, just a contrived example of a special allocator, in this
case (since I was short of time) just linearly allocating 100,000 ints
one after another, then freeing them in the same order. Nevertheless,
malloc
took much longer. The code I used is given below (actually it wasn't
written in C that's why it's 1-based).

It's still not possible for me to make any assessment of this evidence.
Your measurement was not in C, and the C program you posted crashes in
both the USEMALLOC case and the other one.

I am not saying you did not see the numbers you saw. What I *am* saying
is that it would be useful if we had numbers for specific special
allocation types (in C programs) that can be tested on various machines.
Without that, it's just another anecdote about how slow malloc is. It
contributes to the notion of what people seem to think is true but it
does not add to the data about what is actually the case.

<snip>

Ben Bacarisse · May 19, 2011

Uncle Steve said:
It's probably something similar to what I use occasionally:

Yes, similar but even simpler.

I haven't measured it's performance against malloc() just yet, but it
must be something like an order of magnitude faster than malloc.

I am guessing you don't have a background in the sciences!

I would propose that this is a simple example of a 'better algorithm'.

I am glad that's in 'scare quotes'. I don't really want to get into a
discussion of whether a special-case algorithm is better than a general
one. It depends on your metrics, no doubt. What I'd really like is
some data, and even better than that would be some code I could try to
see how general the benefit is (across systems) for some specific
special-purpose allocator.

malloc() is designed to handle the general case of arbitrary
allocations occurring with arbitrary ordering. When you know you're
going to use many of the same size object it is easy to adapt the
algorithm to use this arrangement, and for this special case it is
fast and compact. What is the counterargument against using this kind
of structure?

(a) You have to write it and debug it. The code you posted had bugs,
but that's probably because it was pulled out of context to illustrate
the idea. (BTW, I've been around the block long enough that you could
just have said that it's a free list of fixed size objects).

(b) You might also have to port it (see the problem that BartC's code
had on my machine for example).

(c) Other people might have to learn about the interface (and maybe
persuade themselves of the implementation) if they come to maintain code
that uses it. Every C programmer know malloc(...). You code, for
example, does not allocate and free using pointers -- the functions take
and return an int. This is perfectly logical, but it will take some
getting used for the maintainer.

These all boil down to "more code". The question remains, though, about
the benefits. It would be very helpful to have the speed benefit
quantified.

BartC · May 19, 2011

Ben Bacarisse said:
It's still not possible for me to make any assessment of this evidence.
Your measurement was not in C, and the C program you posted crashes in
both the USEMALLOC case and the other one.

OK, there was a mistake in the way test allocations are freed (although I
didn't see any crashes).

Fixing that, I still saw big factors (15x for gcc, in general 5x-40x
slowdown).

I am not saying you did not see the numbers you saw. What I *am* saying
is that it would be useful if we had numbers for specific special
allocation types (in C programs) that can be tested on various machines.
Without that, it's just another anecdote about how slow malloc is. It
contributes to the notion of what people seem to think is true but it
does not add to the data about what is actually the case.

malloc() can be surprisingly fast. But, in my work on interpreters, using a
custom allocator for blocks below 1KB generally makes execution faster.

For example, executing a program implementing the Binary_Trees benchmark
from the Shootout game, is about 50% faster than if malloc() was used for
everything. Sometimes, code can be much faster (such as 200% faster);
rarely, it can run slower. (An interpreter doesn't know what code mix it
will run.)

These numbers will vary if the malloc() implementation changes,
or a different program is interpreted. But my point is that these custom
allocators should be taken seriously and not just dismissed.

Malloc Query	8	Oct 15, 2008
a fast malloc/free implementation & benchmarks	0	Mar 20, 2011
Questions regarding specialized malloc()/free() replacements	3	Jan 4, 2009
memory managers and malloc/free	3	Aug 12, 2006
Improving memory consumption in the container library	6	Oct 9, 2009
Dealing with naive malloc() implementations	14	May 9, 2007
malloc and realloc problem	1	May 19, 2008
[Slightly OT] Memory management and custom allocator	64	Dec 31, 2011

Malloc Query

Pushpa

Ben Pfaff

Joe Pfeiffer

BGB

Lauri Alanko

Eric Sosman

Michael Press

Nobody

BartC

Pushpa

Ben Bacarisse

Uncle Steve

BartC

Jorgen Grahn

BartC

Ike Naar

BartC

Ben Bacarisse

Ben Bacarisse

BartC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads