B
Bjarke Hammersholt Roune
I want to allocate dynamically on the stack since it is faster, but
that requires platform dependent code and the stack can overflow. So
instead I want to write a memory allocator that works as stack
allocation does including in speed yet is backed by memory from the
heap. I want to be able to replace fast (but terrible!) code like
void foo(const size_t size) {
if (size > HugeConstant)
throw no_can_do_exception;
int hugeArray[HugeConstant];
HugeStruct hugeStruct;
// do computations on hugeArray and hugeStruct
}
or fast (but terrible!) interfaces like
void foo(const size_t size, int* scratchBuffer) {
// do computations using scratchBuffer as temporary memory
}
with something like:
void foo(const size_t size) {
ScopeArrayAlloc<int> hugeArray(size);
ScopeAlloc<HugeStruct> hugeStruct;
// do computations on hugeArray and hugeStruct
}
ScopeAlloc should have these properties:
1) Be nearly as fast as stack allocation.
2) Be type-safe and unable to cause memory leaks or double deletes
3) Live on the heap and take up only around sizeof(void*) bytes on
the stack
4) Allocate memory dynamically and allocate no more memory then
needed
6) Require no platform dependent code
7) Work with any class that could be allocated on the stack
8) Be able to allocate any amount of memory up to what is available
on the machine
Using new/delete inside ScopeAlloc could support all of these except
1. I'll now go into how I want to implement ScopeAlloc, but what I'd
really like is for there to be a Free software library already
available that does all of this. Do you know anything like that?
** Implementing ScopeAlloc
Here's how I'll implement ScopeAlloc. I have some questions about how
I might improve this plan.
All ScopeAlloc's are supported by a static arena allocator - it is the
same allocator for all types T. An arena allocator has a memory area
it allocates out of that is structured like this:
[ -- memory in use -- p -- memory not in use -- ]
Here p is a char* that points into the memory area, and an allocation
of X bytes simply returns p and increments p by X. It may be necessary
to allocate another memory block, so that has to be checked. Here's a
way to do that that doesn't work:
if (p + X < endOfMemoryBlock) ...
It doesn't work because p + X could be more than 1 element past the
end of the memory block which makes it undefined behavior to even
calculate p + X (?). Worse, even if it was defined behavior, there
could be an overflow which would make < do the wrong thing. Here's
something else that doesn't work:
if (bytesAllocated + X < memoryBlockCapacity) ...
Here bytesAllocated + X can overflow. What I'm thinking of now is to
have ScopeAlloc internally use a function like this on a global
allocator to get its memory:
inline char* alloc(const size_t X) {
if (X < remainingCapacity)
allocateMoreMemory(X)
remainingCapacity -= X;
char* const buffer = p;
p += X;
return p;
}
Here allocateMoreMemory would allocate a new block of
2*max(lastBlockSize, X) bytes which would have the effect of making
the condition X < remainingCapacity almost always false. So the cost
here is one very predictable branch and two additions (counting a
subtraction as an addition).
This ignores alignment. ScopeAlloc would have to adjust X to preserve
alignment by rounding up to the proper alignment. To align to
alignment Y, that is to round up to the nearest multiple of Y, I would
calculate (notice no branches)
X = Y * ((X - 1) / Y + 1)
For single objects this would all be compile-time constant, so the
compiler would probably optimize it all away, so no problem. If X is a
dynamic size, then at least Y would be a compile-time constant that is
a power of two, so I would expect the compiler to replace the
multiplication and division by shifts. So the cost would be 4
additions (counting shifts as additions). Maybe a compile-time
evaluatable branch could be used to skip the alignment adjustment at
compile-time if the size of the object that we are making an array of
is already a multiple of Y.
** Question: Is there a better way to do the alignment calculation?
Now for deallocation of a buffer of size X. We can't just do "p -= X;"
since maybe the object being deallocated was allocated on a previous
block of memory to the one that p is pointing into. That has to be
detected, so there must be a branch (grumble grumble). One way to deal
with that would be to check this:
if (p == startOfMemoryblock) {
// Adjust member variables to now allocate
// out of the previous block
}
The condition works since allocations and deallocations are required
to be in reverse order so the only way that we could have gotten to
deallocating into the previous block is if p is at the start of the
current block. The big problem with this approach is that if the
memory use of the program hovers around the end of a block then we are
going to be adjusting lots of fields for lots of allocations and
deallocations to move allocation from one block to the next and then
back. Even worse, the branches that detect when to move to another
block will become poorly predictable which greatly increases the cost
of those branches. Also, memory use will become split across two or
more blocks decreasing locality of reference.
Another approach is to never move allocation into a previous block.
Instead we can let p stay where it is in that case. So the
deallocation function becomes just:
inline free(const size_t X) {
if (p != startOfMemoryBlock) {
p -= X;
remainingCapacity += X;
}
}
If we don't want to waste the memory of the previous blocks, we could
do something like this:
inline free(const size_t X) {
if (p != startOfMemoryBlock) {
p -= X;
remainingCapacity += X;
} else {
// Look into previous block and delete that
// block if it has become completely freed.
}
}
A nice thing here is that the else branch doesn't have to be that
efficient since it can't be taken that many times.
The last point is destruction. ScopeAlloc<T> knows the type of T, so
it can just call the destructor of T directly in the destructor of
ScopeAlloc. A ScopeArrayAlloc<T> could store the size of the array and
then run through a loop to destruct each element. However lots of
classes have do-nothing destructors and when using stack allocation or
even delete[] I'm sure that the compiler is aware of that so it emits
no code to call the destructors. I would like my code to have zero
overhead in that case as well.
Question: If I have a loop that calls do-nothing destructors, can I
generally trust compilers to optimize the loop away?
So the total cost for an allocate/deallocate pair is 2 well-predicted
branches and 4 additions. Add 4 additions for arrays of classes whose
size is not a multiple of the alignment. I'm guessing that dynamic
stack allocation has an overhead of 2 additions per allocate/
deallocate pair, has better cache behavior and dirties fewer
registers. Still, this is much better than new/delete in terms of
performance and it should have better cache behavior than object
pools. Also, unlike arenas the interface for ScopeAlloc has no
pitfalls that is not shared by normal stack allocation.
Any suggestions for improvements or links to similar ideas?
Cheers
Bjarke Hammersholt Roune
www.broune.com
that requires platform dependent code and the stack can overflow. So
instead I want to write a memory allocator that works as stack
allocation does including in speed yet is backed by memory from the
heap. I want to be able to replace fast (but terrible!) code like
void foo(const size_t size) {
if (size > HugeConstant)
throw no_can_do_exception;
int hugeArray[HugeConstant];
HugeStruct hugeStruct;
// do computations on hugeArray and hugeStruct
}
or fast (but terrible!) interfaces like
void foo(const size_t size, int* scratchBuffer) {
// do computations using scratchBuffer as temporary memory
}
with something like:
void foo(const size_t size) {
ScopeArrayAlloc<int> hugeArray(size);
ScopeAlloc<HugeStruct> hugeStruct;
// do computations on hugeArray and hugeStruct
}
ScopeAlloc should have these properties:
1) Be nearly as fast as stack allocation.
2) Be type-safe and unable to cause memory leaks or double deletes
3) Live on the heap and take up only around sizeof(void*) bytes on
the stack
4) Allocate memory dynamically and allocate no more memory then
needed
6) Require no platform dependent code
7) Work with any class that could be allocated on the stack
8) Be able to allocate any amount of memory up to what is available
on the machine
Using new/delete inside ScopeAlloc could support all of these except
1. I'll now go into how I want to implement ScopeAlloc, but what I'd
really like is for there to be a Free software library already
available that does all of this. Do you know anything like that?
** Implementing ScopeAlloc
Here's how I'll implement ScopeAlloc. I have some questions about how
I might improve this plan.
All ScopeAlloc's are supported by a static arena allocator - it is the
same allocator for all types T. An arena allocator has a memory area
it allocates out of that is structured like this:
[ -- memory in use -- p -- memory not in use -- ]
Here p is a char* that points into the memory area, and an allocation
of X bytes simply returns p and increments p by X. It may be necessary
to allocate another memory block, so that has to be checked. Here's a
way to do that that doesn't work:
if (p + X < endOfMemoryBlock) ...
It doesn't work because p + X could be more than 1 element past the
end of the memory block which makes it undefined behavior to even
calculate p + X (?). Worse, even if it was defined behavior, there
could be an overflow which would make < do the wrong thing. Here's
something else that doesn't work:
if (bytesAllocated + X < memoryBlockCapacity) ...
Here bytesAllocated + X can overflow. What I'm thinking of now is to
have ScopeAlloc internally use a function like this on a global
allocator to get its memory:
inline char* alloc(const size_t X) {
if (X < remainingCapacity)
allocateMoreMemory(X)
remainingCapacity -= X;
char* const buffer = p;
p += X;
return p;
}
Here allocateMoreMemory would allocate a new block of
2*max(lastBlockSize, X) bytes which would have the effect of making
the condition X < remainingCapacity almost always false. So the cost
here is one very predictable branch and two additions (counting a
subtraction as an addition).
This ignores alignment. ScopeAlloc would have to adjust X to preserve
alignment by rounding up to the proper alignment. To align to
alignment Y, that is to round up to the nearest multiple of Y, I would
calculate (notice no branches)
X = Y * ((X - 1) / Y + 1)
For single objects this would all be compile-time constant, so the
compiler would probably optimize it all away, so no problem. If X is a
dynamic size, then at least Y would be a compile-time constant that is
a power of two, so I would expect the compiler to replace the
multiplication and division by shifts. So the cost would be 4
additions (counting shifts as additions). Maybe a compile-time
evaluatable branch could be used to skip the alignment adjustment at
compile-time if the size of the object that we are making an array of
is already a multiple of Y.
** Question: Is there a better way to do the alignment calculation?
Now for deallocation of a buffer of size X. We can't just do "p -= X;"
since maybe the object being deallocated was allocated on a previous
block of memory to the one that p is pointing into. That has to be
detected, so there must be a branch (grumble grumble). One way to deal
with that would be to check this:
if (p == startOfMemoryblock) {
// Adjust member variables to now allocate
// out of the previous block
}
The condition works since allocations and deallocations are required
to be in reverse order so the only way that we could have gotten to
deallocating into the previous block is if p is at the start of the
current block. The big problem with this approach is that if the
memory use of the program hovers around the end of a block then we are
going to be adjusting lots of fields for lots of allocations and
deallocations to move allocation from one block to the next and then
back. Even worse, the branches that detect when to move to another
block will become poorly predictable which greatly increases the cost
of those branches. Also, memory use will become split across two or
more blocks decreasing locality of reference.
Another approach is to never move allocation into a previous block.
Instead we can let p stay where it is in that case. So the
deallocation function becomes just:
inline free(const size_t X) {
if (p != startOfMemoryBlock) {
p -= X;
remainingCapacity += X;
}
}
If we don't want to waste the memory of the previous blocks, we could
do something like this:
inline free(const size_t X) {
if (p != startOfMemoryBlock) {
p -= X;
remainingCapacity += X;
} else {
// Look into previous block and delete that
// block if it has become completely freed.
}
}
A nice thing here is that the else branch doesn't have to be that
efficient since it can't be taken that many times.
The last point is destruction. ScopeAlloc<T> knows the type of T, so
it can just call the destructor of T directly in the destructor of
ScopeAlloc. A ScopeArrayAlloc<T> could store the size of the array and
then run through a loop to destruct each element. However lots of
classes have do-nothing destructors and when using stack allocation or
even delete[] I'm sure that the compiler is aware of that so it emits
no code to call the destructors. I would like my code to have zero
overhead in that case as well.
Question: If I have a loop that calls do-nothing destructors, can I
generally trust compilers to optimize the loop away?
So the total cost for an allocate/deallocate pair is 2 well-predicted
branches and 4 additions. Add 4 additions for arrays of classes whose
size is not a multiple of the alignment. I'm guessing that dynamic
stack allocation has an overhead of 2 additions per allocate/
deallocate pair, has better cache behavior and dirties fewer
registers. Still, this is much better than new/delete in terms of
performance and it should have better cache behavior than object
pools. Also, unlike arenas the interface for ScopeAlloc has no
pitfalls that is not shared by normal stack allocation.
Any suggestions for improvements or links to similar ideas?
Cheers
Bjarke Hammersholt Roune
www.broune.com