[Slightly OT] Memory management and custom allocator

Keith Thompson · Jan 2, 2012

Eric Sosman said:
This is where you redesign your struct to use only 32 bytes, or find a
good use for another 20 bytes to make 64. [...]
Such structs are easier to index in arrays.

Click to expand...

Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.

[...]

It's certainly possible that `[]' on an array of 64-bytes elements
will be faster than `[]' on an array with 44-byte elements.
(It's also possible that it could have the same or even slower
performance.)

But padding a structure to make it 64 bytes for the purpose of making
array indexing faster is a classic case of micro-optimization, to
be attempted only if measurements show that it makes a significant
difference.

Kaz Kylheku · Jan 2, 2012

Eric Sosman said:
Eric Sosman said:

On 1/2/2012 9:07 AM, BartC wrote:

Such structs are easier to index in arrays.

Click to expand...

Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.

Click to expand...

Where an address calculation is needed, then it's usually multiply versus a
shift. Although turning 44 bytes/element into 64 just for that reason would
not be worth it. There should be other benefits to using 64.

Note that 44 is 32 + 8 + 4, and so 44x is (32x + 8x + 4x). Your CPU can
multiply by 44 just by doing three shifts and two additions, which can likely
be parallelized.

Joe keane · Jan 2, 2012

If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

I think the context is where you're writing some dynamic array
[e.g. string or vector] and you have some control over the sequence of
capacities [at least that's what i thought]. Of course, you do not want
to call realloc on every operation; you want the progression to be
exponential. In such a case, powers of two are very logical.

Eric Sosman · Jan 2, 2012

Eric Sosman said:
Eric Sosman said:

This is where you redesign your struct to use only 32 bytes, or find a
good use for another 20 bytes to make 64. [...]
Such structs are easier to index in arrays.

Click to expand...

Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.

Click to expand...

[...]

It's certainly possible that `[]' on an array of 64-bytes elements
will be faster than `[]' on an array with 44-byte elements.
(It's also possible that it could have the same or even slower
performance.)

Note that he didn't claim "faster" or "slower," but "easier."
The claim is, I repeat, nonsense.

But padding a structure to make it 64 bytes for the purpose of making
array indexing faster is a classic case of micro-optimization, to
be attempted only if measurements show that it makes a significant
difference.

It's probably not even micro-optimization; I'll stick with my
hypothesis that it's numerological superstition.

The original remark was

[...] if you're being sensible and trying to keep to
power-of-two allocations [...]

.... and I'm objecting to the implication that power-of-two sizes are
in any way "sensible" for their own sake. That's just bogus.

Eric Sosman · Jan 2, 2012

If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

Click to expand...

I think the context is where you're writing some dynamic array
[e.g. string or vector] and you have some control over the sequence of
capacities [at least that's what i thought]. Of course, you do not want
to call realloc on every operation; you want the progression to be
exponential. In such a case, powers of two are very logical.

(Shrug.) No more logical than powers of three, or of three halves
(`size += size/2').

I don't know when the number Two starts to exercise its mystical
fascination for programmers, but the fascination certainly exists: You
find people reaching for powers of two "just because," with no actual
reason for the choice. Maybe it's when the neophyte first learns that
he can sometimes shift instead of multiplying and dividing, or that he
can use AND to calculate some remainders. This may strike the young
and credulous as evidence that Two is magical, and just as some folks
never get over the Tooth Fairy, others never get over the Two'th Fairy.

Yes, folks, yes, there certainly *are* circumstances where power-
of-two sizes are Just What The Doctor Ordered. But there's no good
reason to make everything in sight a power of two! If you allocate
128 array elements to represent the 88 keys of a piano, or sixteen
for the Twelve Days of Christmas, or thirty-two for your fingers and
toes, you're just being silly. Or superstitious.

Keith Thompson · Jan 2, 2012

Eric Sosman said:
Yes, folks, yes, there certainly *are* circumstances where power-
of-two sizes are Just What The Doctor Ordered. But there's no good
reason to make everything in sight a power of two! If you allocate
128 array elements to represent the 88 keys of a piano, or sixteen
for the Twelve Days of Christmas, or thirty-two for your fingers and
toes, you're just being silly. Or superstitious.

Or polydactyl.

BartC · Jan 2, 2012

Yes, folks, yes, there certainly *are* circumstances where power-
of-two sizes are Just What The Doctor Ordered. But there's no good
reason to make everything in sight a power of two! If you allocate
128 array elements to represent the 88 keys of a piano, or sixteen
for the Twelve Days of Christmas, or thirty-two for your fingers and
toes, you're just being silly. Or superstitious.

Odd how the C standard seems preoccupied with such numbers then (or with
powers-of-two-minus-one, the same thing):

(from 5.2.4.1

"- 127 nesting levels of blocks
- 63 nesting levels of conditional inclusion
- 12 pointer, array, and function declarators (in any combinations)
- 63 nesting levels of parenthesized declarators within a full declarator
- 63 nesting levels of parenthesized expressions within a full expression
- 63 significant initial characters in an internal identifier or a macro
- 31 significant initial characters in an external identifier
- 4095 external identifiers in one translation unit
- 511 identifiers with block scope declared in one block
- 4095 macro identifiers simultaneously defined in one
- 127 parameters in one function definition
- 127 arguments in one function call
- 127 parameters in one macro definition
- 127 arguments in one macro invocation
- 4095 characters in a logical source line
- 4095 characters in a character string literal or wide string literal
- 65535 bytes in an object (in a hosted environment only)
- 15 nesting levels for #included files
- 1023 case labels for a switch statement
- 1023 members in a single structure or union
- 1023 enumeration constants in a single enumeration
- 63 levels of nested structure"

(But I'm not sure how that 12 got in there, on the third line; a typo?)

88888 Dihedral · Jan 3, 2012

Bartæ–¼ 2012å¹´1æœˆ3æ—¥æ˜ŸæœŸäºŒUTC+8ä¸Šåˆ5æ™‚36åˆ†45ç§’å¯«é“ï¼š

Eric Sosman said:
Eric Sosman said:

On 1/2/2012 9:07 AM, BartC wrote:

Such structs are easier to index in arrays.

Click to expand...

Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.

Click to expand...

Where an address calculation is needed, then it's usually multiply versusa
shift. Although turning 44 bytes/element into 64 just for that reason would
not be worth it. There should be other benefits to using 64.

As for straddling VM pages, denser packing wins over bloated "nicer"
packing any day. Consider again the example of a thousand structs: I
want to use 44KB, you're suggesting 64KB would be "nicer." Suppose an
8KB page: You want to use eight pages, I want to use six. Suppose a
64B cache line: You want to use a thousand cache lines, I want to use
six hundred eighty-eight.

Click to expand...

You forgot the bit where I suggested trying to get it down to 32 bytes. So
some data structures could get smaller.

You would only expand one up to the next power-of-two if there was a net
benefit. Keeping 20 bytes unused out of 64 wouldn't be a benefit. Storing
something in there that would have needed it's own space anyway might well
be.

This is OK. I remembered that I did my own heap manager for digital
videos or images long time ago. Every instance is of millions of true
color pixels and can be magnified to 16 times in one dimension to be viewed..

It is not difficult at all but just a little bit tedious.

Kleuske · Jan 3, 2012

Yes, folks, yes, there certainly *are* circumstances where power-
of-two sizes are Just What The Doctor Ordered. But there's no good
reason to make everything in sight a power of two! If you allocate 128
array elements to represent the 88 keys of a piano, or sixteen for the
Twelve Days of Christmas, or thirty-two for your fingers and toes,
you're just being silly. Or superstitious.

Bravo! I couldn't have put it better than that.

Joe keane · Jan 3, 2012

And one thing about malloc() I don't like, are the overheads in having to
remember the block sizes; if I allocate 16 bytes, malloc() seems to reserve
either 20 or, more often, 24 bytes on the few compilers I've tried. That's a
50% overhead!

Agreed. When i was doing memory allocation, if you ask for 16 bytes it
really was 16 bytes, plus some per-page control structures stage left.

Trick is all the objects on the same 'page' (not the same as a VM page,
you can make it what size seems to work best) are the same size! So you
just need the page number, look it up in a hash table (don't worry, this
is very fast).

BartC · Jan 3, 2012

Kleuske said:
Bravo! I couldn't have put it better than that.

For user applications, I also agree with all this; there shouldn't be all
these mysterious powers-of-two floating around for no good reason. (And if
you've ever been involved in writing user manuals, the less to have to
explain, the better.)

For higher level, rapid development languages, they are also largely
irrelevant.

But C is a somewhat lower level language, and these powers of two do get
everywhere. You can completely ignore them if you wish, but if often pays to
be aware of them.

Anything that can only be described, or indexed, in N bits, will have 2^N
possibilities; you can't away from that. (And hardware seems to prefer N
itself as a power-of-two; so 8 bits in a byte is more common than 5, 7 or
11.)

And part of the context of this thread is memory allocation strategies,
where there might well be merit in a scheme were allocated blocks can only
change in size by a power of two (and which, if implemented too thinly on
top of malloc(), is affected by malloc()'s own overheads).

I don't think that is being obsessed with it.

Joe keane · Jan 3, 2012

That question is obviously plain wrong, because I *do* like the
allocator taking cycles if this means that, when the memory is low, my
complex data structure resides in one single page and the system has
not to hysterically swap from disk to deal with a pathological
fragmentation.

IMHE, people who try to make the allocation 'as fast as possible' are
just chasing a squirrel up the wrong tree.

No sir, you want to make your clients' code as fast as possible.

Sometimes people recommend a class-based linked list ('look! it's two
instructions!'). In my onion, this is disasterous. Works great for toy
programs, but when you have lots-of-data and long-running combined, your
memory turns into an omelet.

You may as well send invitations to Mr. Cache Miss and Ms. Tlb Fail...

If you have a 'fast' allocator that doesn't have the slighest idea of
locality, and a 'slow' allocator that takes some care of it, the
'slower' code will kick the 'faster' code arse every time.

88888 Dihedral · Jan 3, 2012

Joe keaneæ–¼ 2012å¹´1æœˆ3æ—¥æ˜ŸæœŸäºŒUTC+8ä¸‹åˆ9æ™‚14åˆ†46ç§’å¯«é“ï¼š

IMHE, people who try to make the allocation 'as fast as possible' are
just chasing a squirrel up the wrong tree.

No sir, you want to make your clients' code as fast as possible.

Sometimes people recommend a class-based linked list ('look! it's two
instructions!'). In my onion, this is disasterous. Works great for toy
programs, but when you have lots-of-data and long-running combined, your
memory turns into an omelet.

You may as well send invitations to Mr. Cache Miss and Ms. Tlb Fail...

If you have a 'fast' allocator that doesn't have the slighest idea of
locality, and a 'slow' allocator that takes some care of it, the
'slower' code will kick the 'faster' code arse every time.

This is a system level problem. The allocator under a huge dram space with
a lot free space remained should be designed toward fast easily.

But if the dram is near fully allocated then a good heap walker that can
swap pages to the hard disk is important toward reliability not the speed
part in the allocator to degrade the system speed.

Joe keane · Jan 4, 2012

If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

If the cache line is 16 bytes, we know that each object will take only
three lines; the fourth is never used [it may be worse with the VM].
For this case 48 would work just as well (the compiler will do that for
you if it thinks 8-byte alignment is a good idea). But then if the
cache line is 32 bytes, same argument applies.

Joe keane · Jan 4, 2012

I don't know when the number Two starts to exercise its mystical
fascination for programmers, but the fascination certainly exists: You
find people reaching for powers of two "just because," with no actual
reason for the choice.

You have 'struct x', you add padding, test it, it is faster. You have
'struct y', you add padding, test it, it is faster. You have 'struct
z', you add padding, test it, it is faster. You have 'struct w', you
add padding, test it, it is slower [oops doesn't always work].

You have 'struct a', and your boss says 'can't test! just guess!'.
What would you do?

Karl Malbrain · Jan 5, 2012

For one, it _never_ uses brk(). That syscall is just plain old and does
nothing mmap() couldn't do better. Then it has linked lists prepared for
objects of sizes of powers of two, starting at 16 and stopping at 2048.

Click to expand...

That exactly what I do in my own allocator! Except I stop at 1024. Although
I probably manage them very differently, for example once a small block size
is allocated, say 64 bytes, it always stays that size.

And one thing about malloc() I don't like, are the overheads in having to
remember the block sizes; if I allocate 16 bytes, malloc() seems to reserve
either 20 or, more often, 24 bytes on the few compilers I've tried. That's a
50% overhead!

Also if you're being sensible and trying to keep to power-of-two
allocations, this 4- or 8-byte overhead is going to screw that up.

My allocators have never stored a block size, and it has never really been a
problem. Either you will know the size (a struct for example), or you need
to keep track of it yourself, in which case it is very handy when you have
an allocation that can grow in never having to keep running back to
realloc().

As an illustration, take this simple example of a string growing from oneto
ten million characters:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);
p[len-1]='*';
p[len]=0;
// puts(p);}

printf("Length= %d\n",strlen(p));

}

On three C compilers, this took about 12 seconds (on DMC, any length of
string resulted in a runtime of 0.4 seconds; a bit odd).

Using my strategy, of avoiding any calls to malloc or realloc unless
absolutely necessary, and using that inside an interpreter, this bit of
script took about 0.1 seconds:

string a

a:=""
to 10 million do
a+:="*"
od
println "Length=",a.len

Of course, real C code wouldn't use such a naive strategy; it would also
seek to minimise malloc()/realloc() calls, but that's also an admission that
these calls have some overheads which it would be desirable to avoid. Hence
I can understand the OP's attempt to experiment with his own versions.

You can take a look at my bit-mapped memory allocator: code.google.com/
p/arena-memory-allocation
for an example of out-of-band storage of the block sizes.

Karl m

Eric Sosman · Jan 5, 2012

I don't know when the number Two starts to exercise its mystical
fascination for programmers, but the fascination certainly exists: You
find people reaching for powers of two "just because," with no actual
reason for the choice.

Click to expand...

You have 'struct x', you add padding, test it, it is faster. You have
'struct y', you add padding, test it, it is faster. You have 'struct
z', you add padding, test it, it is faster. You have 'struct w', you
add padding, test it, it is slower [oops doesn't always work].

You have 'struct a', and your boss says 'can't test! just guess!'.
What would you do?

I'd leave it alone, because there are more important things
to attend to.

Eric Sosman · Jan 5, 2012

If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

Click to expand...

If the cache line is 16 bytes, we know that each object will take only
three lines; the fourth is never used [it may be worse with the VM].
For this case 48 would work just as well (the compiler will do that for
you if it thinks 8-byte alignment is a good idea). But then if the
cache line is 32 bytes, same argument applies.

All I can say is that you and I have divergent notions of
efficiency. I'm thinking "I can run over the entire array with
just N*44/C cache misses; Joe would prefer N*64/C." That is, I
think you'll incur 46% more cache misses than I will. Tell me
again, please, what I have wasted by using 31% fewer cache line
loads than you would?

I'm also thinking "I need N*44/P pages for my array; Joe would
rather use N*64/P." That is, I think your working set size will
be 46% larger than mine, your likelihood of taking a page fault
will be 46% greater than mine. Tell me again, please, what have
I wasted by using 31% fewer memory pages?

It's an odd sort of "waste" that spends fewer resources.

BartC · Jan 5, 2012

You can take a look at my bit-mapped memory allocator: code.google.com/
p/arena-memory-allocation
for an example of out-of-band storage of the block sizes.

Thanks, I'll take a closer look later.

But a few things struck me immediately:

(1) It had a comment block at the beginning that actually said what the code
was and gave a brief description, instead of the usual GPL waffle.

(2) It's vfree() function requires the size to be supplied, just like mine
does! So I'm not alone in thinking the way C does it is not the only way.

(3) It has plenty of powers-of-two sprinkled about. With the comments in
this thread I was starting to feel like some sort of dinosaur.

BartC · Jan 5, 2012

Eric Sosman said:
If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

Click to expand...

If the cache line is 16 bytes, we know that each object will take only
three lines; the fourth is never used [it may be worse with the VM].
For this case 48 would work just as well (the compiler will do that for
you if it thinks 8-byte alignment is a good idea). But then if the
cache line is 32 bytes, same argument applies.

Click to expand...

All I can say is that you and I have divergent notions of
efficiency. I'm thinking "I can run over the entire array with
just N*44/C cache misses; Joe would prefer N*64/C." That is, I
think you'll incur 46% more cache misses than I will. Tell me
again, please, what I have wasted by using 31% fewer cache line
loads than you would?

I think he means that the last 16-bytes of the 64 might never be accessed,
so will never be fetched from memory in the first place; there is no cache
miss.

(That does depend on how the 64-bytes are used; if the whole struct is
copied, the compiler might access all 64-bytes anyway.)

And if the cache blocks are 32-bytes, then it doesn't matter.

I'm also thinking "I need N*44/P pages for my array; Joe would
rather use N*64/P." That is, I think your working set size will
be 46% larger than mine, your likelihood of taking a page fault
will be 46% greater than mine. Tell me again, please, what have
I wasted by using 31% fewer memory pages?

No one was advocating wasting 20 bytes, but finding a use for them, perhaps
moving data there that would have been stored elsewhere.

It's an odd sort of "waste" that spends fewer resources.

If you had a heavily used struct, and it turned out it had 65 bytes, that
wouldn't bother you any? You wouldn't try and get it down to 64?

The allocator object	4	Apr 10, 2010
Virtual memory allocator recommendations?	8	Oct 7, 2008
Custom memory allocator - alignment	4	Mar 4, 2005
C++ Memory Management Innovation: GC Allocator	1	Apr 22, 2008
help with memory management strategy?	7	Nov 12, 2007
Custom Allocator Question - Advanced.	3	May 23, 2007
Manual Memory Management and Automatic Garbage Collection	25	Dec 6, 2010
Memory management and allocation	7	Feb 3, 2005

[Slightly OT] Memory management and custom allocator

Keith Thompson

Kaz Kylheku

Joe keane

Eric Sosman

Eric Sosman

Keith Thompson

BartC

88888 Dihedral

Kleuske

Joe keane

BartC

Joe keane

88888 Dihedral

Joe keane

Joe keane

Karl Malbrain

Eric Sosman

Eric Sosman

BartC

BartC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads