[Slightly OT] Memory management and custom allocator

F

FtM

[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).
I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but
I've some questions:
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?
- Does all of this make any sense? :)
Thanks for any suggestion!
Ciao!

P.S. I don't really have the necessity of a "performance boost", but
I'm dealing with a general, low-level and already used library, and as
long as I'm refactoring the code I'd like to do it the better possible
way ;-)
 
E

Eric Sosman

[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).

This is frequently done, usually either to use an alternate memory
manager with extra debugging features or to substitute a manager whose
performance characteristics better match the needs of the program at
hand. However, the C language does not guarantee that it's possible to
do such substitution; "the library" is regarded as monolithic, and not
as something that can be replaced piecemeal. This allows the library
implementor to use internal and unadvertised features, "back doors" if
you like. (For example, exit() needs a "back door" to find and close
all the still-open FILE* streams.) It is at least possible that some
implementation's internals involve "back doors" to the memory manager,
so if you substituted a memory manager that implemented only the public
operations you'd break the library in strange ways.

That's mostly a "theoretical" consideration, though, perhaps since
replacing the memory manager is such an incredibly useful thing to want
to do. Still, keep in mind that it's not guaranteed to work.
I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but
I've some questions:
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?
- Does all of this make any sense? :)

Not much, unless you are fond of re-inventing well-rounded wheels
for your own amusement. The things you mention are already common
practice in many Unixoid allocators; I have no specific knowledge of
Windows' allocators but there's no reason to think Redmond is unaware
of the state(s) of the art(s).
P.S. I don't really have the necessity of a "performance boost", but
I'm dealing with a general, low-level and already used library, and as
long as I'm refactoring the code I'd like to do it the better possible
way ;-)

For a "general low-level" library I think you'll be hard-pressed
just to match the performance of the O/S' own allocators, much less
improve on them. Things would be different for a library tuned to
the specific needs of a particular program and making use of program-
specific knowledge: Which allocations will be short- or long-lived,
which will be used together and would benefit from sharing a single
page, which data items are allocated and freed so frequently that they
merit their own cache, ... Large gains can (sometimes) be had by
exploiting knowledge of this kind, but a "general low-level" library
really can't have that kind of knowledge.

If you stay on the "general low-level" plane, you're just matching
wits with people who are paid to devote a lot of thought to aspects of
this particular problem -- which is why I say you're unlikely even to
do as well as they've already done. More engineer-decades have gone
into the existing allocators than you're likely to be able to devote
to replacing them. There are probably better targets for your effort.
 
F

FtM

[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).

     This is frequently done, usually either to use an alternate memory
manager with extra debugging features or to substitute a manager whose
performance characteristics better match the needs of the program at
hand.  However, the C language does not guarantee that it's possible to
do such substitution; "the library" is regarded as monolithic, and not
as something that can be replaced piecemeal.  This allows the library
implementor to use internal and unadvertised features, "back doors" if
you like.  (For example, exit() needs a "back door" to find and close
all the still-open FILE* streams.)  It is at least possible that some
implementation's internals involve "back doors" to the memory manager,
so if you substituted a memory manager that implemented only the public
operations you'd break the library in strange ways.

     That's mostly a "theoretical" consideration, though, perhaps since
replacing the memory manager is such an incredibly useful thing to want
to do.  Still, keep in mind that it's not guaranteed to work.

You are perfectly right, I optimistically hope that if something like
the memory allocation functions are broken I'll notice it in a short
time and fix it (replacing my allocator with the wrapper) :) Anyway
the "custom" allocator is written for those simple systems that don't
have a dynamic memory allocator at all (and they are probably
disappearing from the market).
     Not much, unless you are fond of re-inventing well-rounded wheels
for your own amusement.  The things you mention are already common
practice in many Unixoid allocators; I have no specific knowledge of
Windows' allocators but there's no reason to think Redmond is unaware
of the state(s) of the art(s).


     For a "general low-level" library I think you'll be hard-pressed
just to match the performance of the O/S' own allocators, much less
improve on them.  Things would be different for a library tuned to
the specific needs of a particular program and making use of program-
specific knowledge: Which allocations will be short- or long-lived,
which will be used together and would benefit from sharing a single
page, which data items are allocated and freed so frequently that they
merit their own cache, ...  Large gains can (sometimes) be had by
exploiting knowledge of this kind, but a "general low-level" library
really can't have that kind of knowledge.

     If you stay on the "general low-level" plane, you're just matching
wits with people who are paid to devote a lot of thought to aspects of
this particular problem -- which is why I say you're unlikely even to
do as well as they've already done.  More engineer-decades have gone
into the existing allocators than you're likely to be able to devote
to replacing them.  There are probably better targets for your effort.

You exactly identified the origin of my doubts: if I go the way I
described, I'd like to give the library user the possibility to switch/
reuse more than one allocator just like the way you can switch
allocators in C++. I know that I can't beat the general standard one,
but using an optimized (maybe third-party?) one only in some occasions
and the system's one otherwise seems reasonable to me.. isn't it?
I'm asking here because I know that it's a common problem that many of
you have faced and solved ;-)
Ciao!
 
J

Johann Klammer

FtM said:
[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).
I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but
I've some questions:

AFAIK the allocator of GNU/Linux libc already does this.
man mallopt will give you some information about it's behavior.
Or read the source code.
No need to reinvent the wheel.
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?

linux libc uses sbrk and mmap... and possibly other syscalls

regards,
JK
 
F

FtM

AFAIK the allocator of GNU/Linux libc already does this.
man mallopt will give you some information about it's behavior.

I've already looked into mallopt but it doesn't seem enough fine-
grained for my needs... Do you know some code or a project that uses
it so I can take a look?
Or read the source code.
No need to reinvent the wheel.

That's the reason of asking before coding ;)
linux libc uses sbrk and mmap... and possibly other syscalls

sure, sbrk() is the way to go for a low-level allocator, but the
question of how I can take a page-aligned segment remains: you can't
be sure of the OS's VA masking mechanism (or maybe yes?). Anyway, I
could proceed by reading the code of a specific kernel, but it would
be totally useless if there isn't a general enough-shared "philosophy"
underneath (read as: "a specific syscall").
I already use mmap() for persistent b-trees, but does it make sense to
use it without an underlying physical file?
Ciao!
 
K

Kaz Kylheku

I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but

What makes you think that your malloc doesn't already handle
such cases?

Say, have you profiled the code to see what fraction of the time is actually
spent executing the allocator code?

Can you answer the question: how much faster will the program be if
a magic allocator is substituted which requires zero cycles to execute?
I've some questions:
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?

Not knowing these basics is an indicator that you need to study more if you
want to write memory allocators that beat malloc.
 
F

FtM

What makes you think that your malloc doesn't already handle
such cases?

Simply because it can't know if I'm going to do 200 little allocations
or 1 big one. Even if it has the most neat heuristic ever, the libc
(or the kernel) exposing a simple specific function/syscall would do
its job way better. Part of the question was about this hypothetical
function.
Say, have you profiled the code to see what fraction of the time is actually
spent executing the allocator code?

Can you answer the question: how much faster will the program be if
a magic allocator is substituted which requires zero cycles to execute?

Let's see... So why do the C++ STLs (or boost, for another example) do
exactly what I'm asking here? :)
Not knowing these basics is an indicator that you need to study more if you
want to write memory allocators that beat malloc.

I already said that I don't want to "beat malloc", just to find a
(probably platform-specific) way to facilitate its job.
Ciao!

P.S. I'm sorry if I'm looking so.. childish(?).. to you but, believe
me, I've written some physical/virtual memory allocators for many
simpler system, I know how systems work behind the scenes, and I
really don't want to start writing something like that: I'm just
asking how to deal with the two specific environments above while
masking the differences between the two and maintaining a standard
interface. I know that it's hard and that someone already did it: I'm
asking *how* he did that (and maybe where!) :)
 
J

Johann Klammer

FtM wrote:
....
I already use mmap() for persistent b-trees, but does it make sense to
use it without an underlying physical file?
....
RTFM: MAP_ANONYMOUS

However, the number of mmaps per process is limited on linux.

JK
 
K

Kaz Kylheku

FtM wrote:
...
...
RTFM: MAP_ANONYMOUS

However, the number of mmaps per process is limited on linux.

That is true, but now for some qualifying cluesticks:

- malloc uses mmap when sbrk runs into a collision, and also for
thread arenas and large allocations. So you don't avoid mmap by using
malloc (though you avoid making lots of little mmap requests).

- brk/sbrk are based on mmmap! so there is no escaping from mmap that way.
So if you make lots of little page-sized brk increments, can you hit
the limit? It would seems to, but ...

- ... luckily, mmap will coalesce adjacent anonymous mappings which have
compatible flags so you will only run into the limit if you make a large
number of maps which do not touch or that do touch but have incompatible
properties.

- The default limit is only 65536 but in the light of the above, it's
maybe not so terrible.

- If you still have a problem, you can easily tweak the maximum count. become
root and then:

sysctl -w vm.max_map_count=<your-desired-number>

better yet, record it in /etc/sysctl.conf and run sysctl -p.
 
E

Eric Sosman

Simply because it can't know if I'm going to do 200 little allocations
or 1 big one. Even if it has the most neat heuristic ever, the libc
(or the kernel) exposing a simple specific function/syscall would do
its job way better. Part of the question was about this hypothetical
function.

Why do you think the kernel predicts your program's future behavior
better than malloc() does?
I already said that I don't want to "beat malloc", just to find a
(probably platform-specific) way to facilitate its job.

Both the kernel and malloc() have a hard time predicting your
program's behavior, but I'll risk a prediction for the New Year:

You will not invent a "general low-level" replacement that
outperforms malloc() et al.
[...] I'm just
asking how to deal with the two specific environments above while
masking the differences between the two and maintaining a standard
interface.

Easy: Use malloc() et al., right out of the box. Platform
specifics are already masked, system-specific dodges and tricks are
already employed, code is already tested and debugged to a degree
that you cannot possibly replicate.

There! Wasn't that easy?
 
M

Markus Wichmann

Simply because it can't know if I'm going to do 200 little allocations
or 1 big one. Even if it has the most neat heuristic ever, the libc
(or the kernel) exposing a simple specific function/syscall would do
its job way better. Part of the question was about this hypothetical
function.

malloc() is already optimized to exactly those cases, as these are the
by far most common ones: Either you want a storage area, the size of
which is determined at run time, then you will probably have one big
allocation, or you are using fancy data structures other than dynamic
arrays (that includes, but is not limited to, linked lists, trees and
hashes), then chances are you are allocating each node separately so you
get your n+1 little allocations.

Having seen the glibc, eglibc, dietlibc and uclibc allocators, I already
saw that those are the preferred modes of operation. The dietlibc
allocator is especially easy to describe:

For one, it _never_ uses brk(). That syscall is just plain old and does
nothing mmap() couldn't do better. Then it has linked lists prepared for
objects of sizes of powers of two, starting at 16 and stopping at 2048.
It rounds the sum of the object header size and the object size to the
next power of two (using some nifty eye-watering bit-mangling) and puts
it in such a linked list. If the list is too small, it is expanded using
mmap() or mremap().

If the object is bigger than 2048 bytes it gets it's own page(s)!
Let's see... So why do the C++ STLs (or boost, for another example) do
exactly what I'm asking here? :)

Hmmm... I just had a look at the libstdc++ deployed with g++ 4.5.0 and I
saw that at least there, operator new calls malloc(). I also looked at
STLport and saw the same.

Also, you didn't answer the questions. Those are actually pretty good ones!
I already said that I don't want to "beat malloc", just to find a
(probably platform-specific) way to facilitate its job.
Ciao!

You are not making sense. You don't want to beat malloc but you still
want to write a malloc replacement? You'd willingly choose the inferior
implementation?
P.S. I'm sorry if I'm looking so.. childish(?).. to you but, believe
me, I've written some physical/virtual memory allocators for many
simpler system,

That's an entirely different can of worms. Believe me, the memory
management that just uses an OS is far easier than the one the OS has to
maintain.
I know how systems work behind the scenes, and I
really don't want to start writing something like that: I'm just
asking how to deal with the two specific environments above while
masking the differences between the two and maintaining a standard
interface. I know that it's hard and that someone already did it: I'm
asking *how* he did that (and maybe where!) :)

You could look at the libc source codes. glibc, eglibc, dietlibc and
uclibc are a good place to start for Linux. For Windows I'd recommend
cygwin.

If you are interested in the syscalls that allocate memory: For *nix
that's mmap() with MAP_ANONYMOUS or, failing that, /dev/zero. Failing
_that_, you could try to map a newly created and truncated file. Try to
stay away from brk(), it's marked obsolete.

For Windows, have a look at HeapAlloc(), VirtualAlloc() and
LocalAlloc(). BTW: The MSVC implementation somehow manages to make
access to _one_ byte after the allocated buffer fail with a crash. Good
job, as it makes finding off-by-one errors easier.

But I still don't see your point. malloc() is meant to be portable! The
only thing you might want to tweak against is malloc(0), as that's
unspecified. Several allocators return NULL on that, but several others
do not. They just present you an object you can't even access the first
byte of.

HTH,
Markus
 
B

BartC

Markus Wichmann said:
On 31.12.2011 18:52, FtM wrote:
For one, it _never_ uses brk(). That syscall is just plain old and does
nothing mmap() couldn't do better. Then it has linked lists prepared for
objects of sizes of powers of two, starting at 16 and stopping at 2048.

That exactly what I do in my own allocator! Except I stop at 1024. Although
I probably manage them very differently, for example once a small block size
is allocated, say 64 bytes, it always stays that size.

And one thing about malloc() I don't like, are the overheads in having to
remember the block sizes; if I allocate 16 bytes, malloc() seems to reserve
either 20 or, more often, 24 bytes on the few compilers I've tried. That's a
50% overhead!

Also if you're being sensible and trying to keep to power-of-two
allocations, this 4- or 8-byte overhead is going to screw that up.

My allocators have never stored a block size, and it has never really been a
problem. Either you will know the size (a struct for example), or you need
to keep track of it yourself, in which case it is very handy when you have
an allocation that can grow in never having to keep running back to
realloc().

As an illustration, take this simple example of a string growing from one to
ten million characters:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);
p[len-1]='*';
p[len]=0;
// puts(p);
}
printf("Length= %d\n",strlen(p));
}

On three C compilers, this took about 12 seconds (on DMC, any length of
string resulted in a runtime of 0.4 seconds; a bit odd).

Using my strategy, of avoiding any calls to malloc or realloc unless
absolutely necessary, and using that inside an interpreter, this bit of
script took about 0.1 seconds:

string a

a:=""
to 10 million do
a+:="*"
od
println "Length=",a.len

Of course, real C code wouldn't use such a naive strategy; it would also
seek to minimise malloc()/realloc() calls, but that's also an admission that
these calls have some overheads which it would be desirable to avoid. Hence
I can understand the OP's attempt to experiment with his own versions.
 
E

Eric Sosman

That exactly what I do in my own allocator! Except I stop at 1024.
Although I probably manage them very differently, for example once a
small block size is allocated, say 64 bytes, it always stays that size.

Sounds like a recipe for internal fragmentation, but YMMV.
And one thing about malloc() I don't like, are the overheads in having
to remember the block sizes; if I allocate 16 bytes, malloc() seems to
reserve either 20 or, more often, 24 bytes on the few compilers I've
tried. That's a 50% overhead!

It's only significant if you have many such small allocations.
Even if so, a malloc() implementation can find other ways to remember
the size than by storing it explicitly. For example, malloc() might
set aside regions of memory dedicated to "small" allocations: If an
pointer aims at an address anyhere in [A1,A1+N1), [A2,A2+N2), ... it
could simply be "known" to point at a 32-byte block. (The test can't
be done efficiently in portable C, but a malloc() implementation need
not be portable, nor written in C.)
Also if you're being sensible and trying to keep to power-of-two
allocations, this 4- or 8-byte overhead is going to screw that up.

What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.
My allocators have never stored a block size, and it has never really
been a problem. Either you will know the size (a struct for example), or
you need to keep track of it yourself, in which case it is very handy
when you have an allocation that can grow in never having to keep
running back to realloc().

If your allocators never store block sizes, it follows that free()
cannot recycle a released block for re-use by a subsequent malloc(),
because it does not know whether the freed block is big enough to
satisfy the malloc() request. (Re-computing the size by examining the
addresses of nearby busy and free blocks counts as "storing the size"
in my book, just encoding it deviously.)

An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.
As an illustration, take this simple example of a string growing from
one to ten million characters:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);

Side note: Never call realloc() this way in real code, because if
it returns NULL you'll have lost your only pointer to the original
block; you won't even be able to free() it any more.
p[len-1]='*';
p[len]=0;
// puts(p);
}

With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:

- Since realloc() does not know how long the block at `p' is, it
cannot tell whether there's any slop at the end that the block
could grow into without moving. Therefore, every realloc() call
must allocate a new block and copy the old contents into it.

- Since realloc() does not know how long the old block is, it
must copy `len+1' bytes from the old block to the new, even
though that's more than the old block holds. (We'll assume a
non-portable implementation can get away with copying the non-
existent bytes.) Total amount copied: 2+...+10000001 bytes;
that's fifty terabytes.

- Since realloc() does not know how long the old block was, it
cannot re-use that memory to satisfy subsequent requests. Thus,
every call allocates a brand-new block and makes the old block
unavailable. Total memory allocated: 1+2+...+10000001 bytes.
50TB to store a 10MB string is an overhead of fifty million
percent -- and you groused about a mere fifty???!
 
E

Eric Sosman

That exactly what I do in my own allocator! Except I stop at 1024.
Although I probably manage them very differently, for example once a
small block size is allocated, say 64 bytes, it always stays that size.

Sounds like a recipe for internal fragmentation, but YMMV.
And one thing about malloc() I don't like, are the overheads in having
to remember the block sizes; if I allocate 16 bytes, malloc() seems to
reserve either 20 or, more often, 24 bytes on the few compilers I've
tried. That's a 50% overhead!

It's only significant if you have many such small allocations.
Even if so, a malloc() implementation can find other ways to remember
the size than by storing it explicitly. For example, malloc() might
set aside regions of memory dedicated to "small" allocations: If an
pointer aims at an address anyhere in [A1,A1+N1), [A2,A2+N2), ... it
could simply be "known" to point at a 32-byte block. (The test can't
be done efficiently in portable C, but a malloc() implementation need
not be portable, nor written in C.)
Also if you're being sensible and trying to keep to power-of-two
allocations, this 4- or 8-byte overhead is going to screw that up.

What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.
My allocators have never stored a block size, and it has never really
been a problem. Either you will know the size (a struct for example), or
you need to keep track of it yourself, in which case it is very handy
when you have an allocation that can grow in never having to keep
running back to realloc().

If your allocators never store block sizes, it follows that free()
cannot recycle a released block for re-use by a subsequent malloc(),
because it does not know whether the freed block is big enough to
satisfy the malloc() request. (Re-computing the size by examining the
addresses of nearby busy and free blocks counts as "storing the size"
in my book, just encoding it deviously.)

An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.
As an illustration, take this simple example of a string growing from
one to ten million characters:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);

Side note: Never call realloc() this way in real code, because if
it returns NULL you'll have lost your only pointer to the original
block; you won't even be able to free() it any more.
p[len-1]='*';
p[len]=0;
// puts(p);
}

With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:

- Since realloc() does not know how long the block at `p' is, it
cannot tell whether there's any slop at the end that the block
could grow into without moving. Therefore, every realloc() call
must allocate a new block and copy the old contents into it.

- Since realloc() does not know how long the old block is, it
must copy `len+1' bytes from the old block to the new, even
though that's more than the old block holds. (We'll assume a
non-portable implementation can get away with copying the non-
existent bytes.) Total amount copied: 2+...+10000001 bytes;
that's fifty terabytes.

- Since realloc() does not know how long the old block was, it
cannot re-use that memory to satisfy subsequent requests. Thus,
every call allocates a brand-new block and makes the old block
unavailable. Total memory allocated: 1+2+...+10000001 bytes.
50TB to store a 10MB string is an overhead of fifty million
percent -- and you groused about a mere fifty???!
 
B

BartC

Eric Sosman said:
On 1/1/2012 7:54 AM, BartC wrote:

Sounds like a recipe for internal fragmentation, but YMMV.

That's what I once thought. I've used a similar scheme for years (blocks
could get smaller but not bigger), and originally prepared a defragmentation
algorithm, but I never needed it. (This was for managing many small
allocations in a script language.) It was self-sustaining.
What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.

More like habit. But, if you allocate 4096 bytes thinking it would fit
neatly into one page of virtual memory, you will really need 4100 or 4104;
not quite as tidy. (This is an issue with my allocator, which does request
power-of-two sizes, that is implemented on top of malloc()).
An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.

Yes, that's what I used; free() takes an extra parameter.
With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:
- Since realloc() does not know how long the block at `p' is, it
<snip>

You're now talking about a hypothetical realloc() that really has no idea
what the original block size was?

That wouldn't work anyway, as it would have no idea whether the block is
expanding or contracting and there would be other issues such as those you
mention.

Such a function *has* to be able to determine the original size, but storing
it as malloc/realloc presumably does, deducing it, or simply being told.
(Actually my scheme doesn't even have a realloc()-type function; but with
block-sizes known, it would be trivial to write)
 
F

FtM

That exactly what I do in my own allocator! Except I stop at 1024. Although
I probably manage them very differently, for example once a small block size
is allocated, say 64 bytes, it always stays that size.

Ok, thank you all for the suggestions (that basically are "use malloc
and don't bother"). I know that the allocation functions are great as
they are: I'm using them right now and I have no problems at all! I
was only trying to expand the code from the embedded part that
contains some code that I already have, that's it. More on, I already
have some handlers that help me debugging the algorithms..

Anyway, regarding the allocator descriptions that Markus and BartC
gave, that's what I was thinking and partially already doing, let's
see if this makes more sense to you:
- first of all, I'm dealing with the memory allocation only *inside*
my library, so I don't need to build a general purpose allocator.
- inside, I'll have three different "objects": the first one is for
little allocations, the second one for the others, the third one for
the medium/big allocations that possibly will need to realloc(). The
third allocator follows the "normal" strategy (and could simply use
the system calls for that matter), while the first and the second one
could simply have some preallocated blocks and a vector to mark if
they are used or not, skipping the natural overhead needed by the
general purpose allocator.

What do you think about this approach (despite the fact that in the
end it probably gives no noticeable advantages)?
Ciao!
 
J

Joe keane

- Does all of this make any sense? :)

IMHE this come up when

a) you are not pleased with the default functions' time or space
overhead
b) you like to control things
c) you want the 'memory' in memory-mapped files
d) you want to use shared memory
 
J

Joe keane

What's "sensible" about allocating in powers of two?

A) VM pages are powers of two, so if you have non-two objects you waste
space or take more VM pages for each object than you would like

B) cache lines are powers of two, so if you have non-two objects you
waste space or take more cache lines for each object than you would like
Programmers are numerologically superstitious.

i am numerologically superstitious, but you can also file this into the
'maybe someone has figured out something that we don't understand'
category
 
B

BartC

[growing a string from empty, to 10 million chars]
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);
p[len-1]='*';
p[len]=0;
}
On three C compilers, this took about 12 seconds (on DMC, any length of
string resulted in a runtime of 0.4 seconds; a bit odd).

Measuring the number of times the value of p changes, with the slow
compilers, this was respectively 2335, 2336, and 2337. With DMC, it was just
46.

Clearly DMC overallocates more than the others (growing allocations by
sqrt(2) apparently).

(My own code, which just uses doubling, will call a memory allocator only
about 20 times, so doesn't have the overheads of calling a function like
realloc() ten million times, hence it was quite quick even though it was
interpreted (but heavily optimised too..))

So with an application with certain memory usage patterns, one
implementation can dramatically outperform another. Another reason, even if
not writing your own allocators, at least to not just use what's provided
without question.
 
B

BartC

- inside, I'll have three different "objects": the first one is for
little allocations, the second one for the others, the third one for
the medium/big allocations that possibly will need to realloc(). The
third allocator follows the "normal" strategy (and could simply use
the system calls for that matter), while the first and the second one
could simply have some preallocated blocks and a vector to mark if
they are used or not, skipping the natural overhead needed by the
general purpose allocator.

What do you think about this approach (despite the fact that in the
end it probably gives no noticeable advantages)?

I don't think there's enough information here; what sort of sizes are the
three classes of allocator handling?

Will any of them always be a fixed size? Will they need deallocating? (As
sometimes this isn't necessary, or the pool of memory from which they're
drawn will be freed as one block.) What strategy will be used to grow the
heap from which the the blocks will come; in fact does this heap need to
grow? Where does the memory come from in the first place? Etc..
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top