How to choose buffer size?


R

Raj Pashwar

Hello friends:

In this program I'm writing I want to pre-allocate space for a but
an indeterminate (but potentially large) number of instances of a
given structure. And if this amount of memory proves insufficient,
I'll add another large chunk with realloc. Etc.

So somewhere I my code I'll have an initial pre-allocation that
looks like:

size_t size = INITIAL_SIZE;
foo *pile_o_foo;
int next_slot = 0;

pile_o_foo = (foo *)malloc(size);
/* error check */

/* ... */

while (size <= next_slot * sizeof foo) {
size *= 2;
pile_o_foo = (foo *)realloc(pile_o_foo, size);
/* error check */
}

/* ... */

Is there a way to choose a good value for INITIAL_SIZE? Should I
just pick a random number? 101? 1001? 5091? Or should I
pick a nice power of 2, like 1024 or 4096? Or is
there a more intelligent way to go about this?

Best regards
 
Ad

Advertisements

J

James Kuyper

Hello friends:

In this program I'm writing I want to pre-allocate space for a but
an indeterminate (but potentially large) number of instances of a
given structure. And if this amount of memory proves insufficient,
I'll add another large chunk with realloc. Etc.

So somewhere I my code I'll have an initial pre-allocation that
looks like:

size_t size = INITIAL_SIZE;
foo *pile_o_foo;
int next_slot = 0;

pile_o_foo = (foo *)malloc(size);

The cast is unnecessary: it specifies explicitly precisely the same
conversion that would have occurred implicitly if there had been no
cast. That is reason enough to remove it. Necessary casts are almost
always dangerous, while conversions that can happen implicitly are
generally much safer. Therefore, experienced coders look at all casts as
potential problems; removing unnecessary casts creates fewer
distractions, allowing you to concentrate on the truly dangerous ones.

In addition, if you use C90 and forget to #include <stdlib.h>, this cast
will usually prevent you from getting an error message due to the fact
that there is no prototype for malloc() in scope. The behavior of such
code is undefined - it might work with any given implementation of C,
but when you port it to a different implementation there's a good chance
it will fail catastrophically.
/* error check */

/* ... */

while (size <= next_slot * sizeof foo) {
size *= 2;
pile_o_foo = (foo *)realloc(pile_o_foo, size);
/* error check */
}

/* ... */

Is there a way to choose a good value for INITIAL_SIZE? Should I
just pick a random number? 101? 1001? 5091? Or should I
pick a nice power of 2, like 1024 or 4096? Or is
there a more intelligent way to go about this?

First of all, INITIAL_SIZE should be an exact multiple of sizeof(foo).
The easiest way to do this is to keep track of the number of instances
of foo, rather than the number of bytes. Then you can write:

pile_o_foo = malloc(num_instances * sizeof *pile_o_foo);

Secondly, while it's true that there are many contexts where a size that
is a power of 2 has efficiency advantages, that is, paradoxically, a
reason why the size that you use here should NOT be a power of 2. Why?
because many implementations of malloc() add a small amount to your
allocation, to store information to be used by the malloc() family of
functions about the allocation. They then round up the allocation to the
next multiple of a fixed block size. That size is usually a power of 2,
or a small integer multiple of one. Therefore, by making your allocation
exactly a power of 2, or even a small multiple of a power of two,
there's a good chance that malloc will allocate slightly more than a
power of 2 bytes, thereby wasting an entire memory block to hold just a
few more bytes of information.

If you knew the size that was added, and the block size, you could
choose your sizes to be an exact multiple of a block size AFTER the
addition; at the cost of tying your code to a particular implementation.
You're better off not worrying about such things.

For the same reason, multiplying by 2 every time you need more memory
will sooner or later cause exactly the same problem. I've seen analyses
that ignore this issue, and end up concluding that sqrt(2) it the ideal
growth factor, but if you use that value exactly, you'll have the same
problem on every other growth step (You can get an average growth factor
of exactly sqrt(2) by alternating between 3/2 and 4/3). If you round
sqrt(2) up to 1.5, then you'll avoid the problem entirely. To avoid
unnecessary conversions to floating point types, use

num_instances = (num_instances*3)/2;

As China Blue said: the exact value of INITIAL_SIZE is not very
important. It would only be important if it's MUCH smaller (by several
orders of magnitude) than the typical size that will eventually be needed.
 
E

Eric Sosman

Hello friends:

In this program I'm writing I want to pre-allocate space for a but
an indeterminate (but potentially large) number of instances of a
given structure. And if this amount of memory proves insufficient,
I'll add another large chunk with realloc. Etc.

So somewhere I my code I'll have an initial pre-allocation that
looks like:

size_t size = INITIAL_SIZE;
foo *pile_o_foo;
int next_slot = 0;

pile_o_foo = (foo *)malloc(size);
/* error check */

As an aside: It would probably be less error-prone to keep
track of the size of the allocation in terms of the `foo' things
it contains, rather than in terms of the number of bytes they
occupy. You'd then have something like

size_t count = INITIAL_COUNT;
...
pile_o_foo = malloc(count * sizeof *pile_o_foo);

Two notes about that final line: First, there's no need to cast
the result of malloc(), although the cast is harmless. Second, it's
safer to write `sizeof *pile_o_foo' than `sizeof(foo)', since this
way you cannot possibly write `sizeof(foobar)' by mistake.
/* ... */

while (size<= next_slot * sizeof foo) {

Aside: You're missing some parentheses.
size *= 2;
pile_o_foo = (foo *)realloc(pile_o_foo, size);
/* error check */

Yet another aside: Don't do `p = realloc(p,...)', because if
realloc() returns NULL you'll have overwritten your pointer to the
original area. If "error check" is just going to terminate the
program abruptly that may be tolerable, but if not it may have
leaked the old memory: You can no longer even free() it since the
pointer to it is gone.
}

/* ... */

Is there a way to choose a good value for INITIAL_SIZE? Should I
just pick a random number? 101? 1001? 5091? Or should I
pick a nice power of 2, like 1024 or 4096? Or is
there a more intelligent way to go about this?

INITIAL_SIZE (or INITIAL_COUNT) depends on "domain-specific
knowledge," and there's no single prescription that works for all
domains. If each `foo' holds the quantum numbers for a single
electron and you're reading one atom's worth, you have prior knowledge
that ten dozen will suffice for all elements yet discovered. If the
atoms are all involved in organic chemistry, electron counts will be
almost always less than two dozen. But if each `foo' represents one
IP packet on an undersea fibre, it's hard to put a useful ceiling on
how large the array might grow.

So: First, apply domain-specific knowledge to the extent that you
have it. If `foo' is a maid of honor at your former roommate's best
friend's second cousin's wedding, INITIAL_COUNT might well be four.
Or even eight: The penalty for overestimating is modest.

Second, if the only domain-specific knowledge you can summon is
"It's probably not HUGE" (for some unstated value of HUGE), you might
as well start INITIAL_COUNT at forty-two and chance the consequences.
You're balancing space wastage against copying time, and for non-HUGE
populations neither will be HUGE.

Third, if all you know is "It might be almost anything, including
REALLY REALLY HUGE," consider other data structures. A linked list of
"buckets" of a thousand `foo' instances each, for example: Yes, it's
harder to traverse (but not *that* much harder), but you avoid needing
to copy and re-copy and re-re-copy all those larger and larger arrays
as you're accumulating the data.

One other tiny practicality: When allocating a big blob of memory,
it is probably best to ask for slightly less than a power of two. This
is because nearly all malloc() implementations inflate your request by
a few bytes to add their own metadata, and some implementations round
the sum up to the next power of two. So if you ask for 1024 bytes and
malloc() adds eight, you'll use as much memory as if you'd asked for
2040 bytes. (But note the word "tiny" above: Concerns of this kind
only come into play if you're allocating VERY MANY such blocks, or
VERY LARGE ones. Also, the internals of malloc() implementations vary
a great deal from platform to platform, and too much pandering to the
peculiarities of a few is vanity and vexation of spirit.)
 
I

Ike Naar

[...] To avoid
unnecessary conversions to floating point types, use

num_instances = (num_instances*3)/2;

Caveat: this will not increase num_instances if it equals 1.
 
J

James Kuyper

[...] To avoid
unnecessary conversions to floating point types, use

num_instances = (num_instances*3)/2;

Caveat: this will not increase num_instances if it equals 1.

If it makes sense to have an initial value of 1 for num_instances, you
can either have special case handling for num_instances == 1, or use
(num_instances*3+1)/2. If it makes sense to have an initial value of 0
for num_instances, you could use special handling or ((num_instances+1)*3)/2
 
P

Paul N

  while (size <= next_slot * sizeof foo) {
    size *= 2;
    pile_o_foo = (foo *)realloc(pile_o_foo, size);
    /* error check */
  }

This looks wrong to me - you know how much you need but (if next_slot
* sizeof foo is more than twice size) you are realloc'ing a smaller
amount first. There seems no point to this. Try something like:

if (size <= next_slot * sizeof foo) {
while (size <= next_slot * sizeof foo) size *= 2;
new_pile_o_foo = (foo *)realloc(pile_o_foo, size);
/* error check */
}
 
Ad

Advertisements

E

Eric Sosman

[...] To avoid
unnecessary conversions to floating point types, use

num_instances = (num_instances*3)/2;

Caveat: this will not increase num_instances if it equals 1.

Because of that, and to make the calculation very slightly less
vulnerable to overflow, I use

num_instances += num_instances / 2 + 10;

(for suitable values of 2 and 10). Works even if num_instances is
initially zero.
 
Ad

Advertisements

M

Malcolm McLean

בת×ריך ×™×•× ×©×™×©×™, 20 ב×פריל 2012 23:05:23 UTC+1, מ×ת Raj Pashwar:
Is there a way to choose a good value for INITIAL_SIZE? Should I
just pick a random number? 101? 1001? 5091? Or should I
pick a nice power of 2, like 1024 or 4096? Or is
there a more intelligent way to go about this?
When allocating a buffer, choose an initial size which is a) a trivial amount of memory and b) likely to fulfil trivial requests b) isn't always possible to define, you might not know the distribution of likely requests. But what's technically known as the log-normal distribution is very common in real situations. You have some companies with one or two employees, and large number with ten to twenty, then a long right tail with a small number of individual companies with very high employee counts. So the initial buffer should be about 20.

Then to grow, I normally use add a percentage plus a small constant. This means that you avoid excessive reallocations at the beginning, but don't need to ask for too much when the buffer gets large. 10% is a good choice if memory is precious, 50% for the more common case where it's quite cheap but not of negligible value.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top