# Algorithm for growing arrays

Discussion in 'Java' started by Chris, Aug 8, 2006.

1. ### ChrisGuest

This is an ill-defined question, but here goes:

My app uses lots of int arrays. We usually don't know how large the
array will ultimately be when we first start populating it. So we grow
the array as necessary, which involves creating a new, larger array and
copying the data over from the old one.

The question is, when we create a larger array, how large should it be?
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and speed.

At one extreme, we could grow each array an element at a time, that is,
the new size would be the old size + 1. This is memory efficient, but
slow and generates a lot of garbage. At the other extreme, we could
allocate huge arrays at the beginning and double them each time we
needed more space. This would be fast because we would seldom need to
extend the arrays, but very memory inefficient.

Just for reference, java.util.ArrayList uses this formula internally:

int newCapacity = (oldCapacity * 3)/2 + 1;

Chris, Aug 8, 2006

2. ### Patricia ShanahanGuest

Chris wrote:
> This is an ill-defined question, but here goes:
>
> My app uses lots of int arrays. We usually don't know how large the
> array will ultimately be when we first start populating it. So we grow
> the array as necessary, which involves creating a new, larger array and
> copying the data over from the old one.
>
> The question is, when we create a larger array, how large should it be?
> Assume that all arrays will eventually grow to a random size, unknowable
> in advance. Assume also that we want to optimize for both memory and speed.

Do you have any data about the distribution of the sizes? Do have a
"cost" function that combines memory and speed costs? How many bytes of
memory are you prepared to waste to avoid copying n bytes of existing array?

If you have that sort of data, you may be able to work out an optimum

If not, I suggest assuming the new size will be a linear function of the
existing size, give yourself some way to change the coefficients, and
experiment in the working program.

Patricia

Patricia Shanahan, Aug 8, 2006

3. ### Steve W. JacksonGuest

In article <44d8c00e\$0\$24214\$>,
Chris <> wrote:

> This is an ill-defined question, but here goes:
>
> My app uses lots of int arrays. We usually don't know how large the
> array will ultimately be when we first start populating it. So we grow
> the array as necessary, which involves creating a new, larger array and
> copying the data over from the old one.
>
> The question is, when we create a larger array, how large should it be?
> Assume that all arrays will eventually grow to a random size, unknowable
> in advance. Assume also that we want to optimize for both memory and speed.
>
> At one extreme, we could grow each array an element at a time, that is,
> the new size would be the old size + 1. This is memory efficient, but
> slow and generates a lot of garbage. At the other extreme, we could
> allocate huge arrays at the beginning and double them each time we
> needed more space. This would be fast because we would seldom need to
> extend the arrays, but very memory inefficient.
>
>
> Just for reference, java.util.ArrayList uses this formula internally:
>
> int newCapacity = (oldCapacity * 3)/2 + 1;

If you've already looked into what ArrayList uses, then I think a rather
obvious question is why not use an ArrayList to begin with?

Using ArrayList<Integer> might not be as flexible as an array of int
since Integer is immutable. That's easily rectified with a mutable
plus the automatic ability to grow. The number of available but unused
entries isn't really relevant since they don't contain data until you
put something there.

= Steve =
--
Steve W. Jackson
Montgomery, Alabama

Steve W. Jackson, Aug 8, 2006
4. ### Matt RoseGuest

Chris wrote:
> This is an ill-defined question, but here goes:
>
> My app uses lots of int arrays. We usually don't know how large the
> array will ultimately be when we first start populating it. So we grow
> the array as necessary, which involves creating a new, larger array and
> copying the data over from the old one.
>
> The question is, when we create a larger array, how large should it be?
> Assume that all arrays will eventually grow to a random size, unknowable
> in advance. Assume also that we want to optimize for both memory and speed.
>
> At one extreme, we could grow each array an element at a time, that is,
> the new size would be the old size + 1. This is memory efficient, but
> slow and generates a lot of garbage. At the other extreme, we could
> allocate huge arrays at the beginning and double them each time we
> needed more space. This would be fast because we would seldom need to
> extend the arrays, but very memory inefficient.
>
>
> Just for reference, java.util.ArrayList uses this formula internally:
>
> int newCapacity = (oldCapacity * 3)/2 + 1;

caveat: I cannot condone premature optimisation... but it can be fun!

It depends on the usage of these arrays but under some circumstances it
might pay to create a new array each time you run out of space but
don't copy the data across, just keep a reference in some sort of
ordered data structure (possibly another array, as long as you're
willing to cope with running out of space with that one) then when you
need to access all the data just iterate over all your arrays. This
gets more complex if the data might arrive in unpredictably sized
chunks.

Of course, at some point you might have to give in and declare the
extra man hours to maintain your own malloc will cost more than the
hardware required to just use an API List implementation...

Matt

Matt Rose, Aug 8, 2006
5. ### Luc MercierGuest

As long as the growth is exponential, the amortized complexity of adding
a new element --- ie, the average on many addition of new elements ---
is constant. That means the time required to copy the array into a
bigger one as little influence on the runtime, compared to an array of
the good size.

Luc Mercier, Aug 8, 2006
6. ### Luc MercierGuest

Luc Mercier wrote:
> As long as the growth is exponential, the amortized complexity of adding
> a new element --- ie, the average on many addition of new elements ---
> is constant. That means the time required to copy the array into a
> bigger one as little influence on the runtime, compared to an array of
> the good size.

(Sorry, that was not finished)

You're saying ArrayList use a factor of 3/2. For performances, at least
for the theoretical point of view, you can as well choose a factor of
1.1 or 2, and the runtime performance will be similar.

But if you choose a LINEAR growth instead of EXPONENTIAL -- ie,
newLength = oldLength + constant --, then the amortized time of the
addition of an element become linear, so it will be MUCH slower.

Look at any introduction to amortized complexity analysis for more details.

Luc Mercier, Aug 8, 2006
7. ### Lasse Reichstein NielsenGuest

Luc Mercier <> writes:

> As long as the growth is exponential, the amortized complexity of
> adding a new element --- ie, the average on many addition of new
> elements --- is constant. That means the time required to copy the
> array into a bigger one as little influence on the runtime, compared
> to an array of the good size.

It is true that the amortized cost is constant, but if the exponent is
close to 1.0, it can be a big constant, and the copying will have a
significant influence on runtime.

For example, if you double the array each time, the amortized cost
of an insertion is three element copies (paying for its own insertion
and the first copying of itself and one existing element).

If you only grow the array by 10% each time, the amortized cost of an
insertion will be 11 element copies, i.e., almost four times slower
than for doubling (if we only count array stores as complexity, and
not, e.g., time spent in memory allocation).

/L
--
Lasse Reichstein Nielsen -
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Lasse Reichstein Nielsen, Aug 8, 2006
8. ### bugbearGuest

Chris wrote:
> This is an ill-defined question, but here goes:
>
> My app uses lots of int arrays. We usually don't know how large the
> array will ultimately be when we first start populating it. So we grow
> the array as necessary, which involves creating a new, larger array and
> copying the data over from the old one.
>
> The question is, when we create a larger array, how large should it be?
> Assume that all arrays will eventually grow to a random size, unknowable
> in advance. Assume also that we want to optimize for both memory and speed.
>
> At one extreme, we could grow each array an element at a time, that is,
> the new size would be the old size + 1. This is memory efficient, but
> slow and generates a lot of garbage. At the other extreme, we could
> allocate huge arrays at the beginning and double them each time we
> needed more space. This would be fast because we would seldom need to
> extend the arrays, but very memory inefficient.
>

you could implement the List interface over a linked list
of blocks of (say) 1000 objects each.

This structure can be grown without garbage or cpu load.

Random access on such a structure (in this instance)
space overhead is 1000 times lower than a pure list.

Clearly an iterator would be very efficient.

BugBear

bugbear, Aug 9, 2006
9. ### Chris UppalGuest

Chris wrote:

> Assume that all arrays will eventually grow to a random size, unknowable
> in advance. Assume also that we want to optimize for both memory and
> speed.

Under those assumptions the problem is insoluble. Unless you have, at minimum,
some knowledge of the statistical distribution of final sizes, you cannot
optimise for both space and speed simultaneously.

-- chris

Chris Uppal, Aug 9, 2006
10. ### Robert KlemmeGuest

On 09.08.2006 11:36, Chris Uppal wrote:
> Chris wrote:
>
>> Assume that all arrays will eventually grow to a random size, unknowable
>> in advance. Assume also that we want to optimize for both memory and
>> speed.

>
> Under those assumptions the problem is insoluble. Unless you have, at minimum,
> some knowledge of the statistical distribution of final sizes, you cannot
> optimise for both space and speed simultaneously.

Just to throw in another idea, a BitSet could be an efficient solution
as well - but this of course depends on semantics and usage pattersn.
Just my 0.02EUR...

Kind regards

robert

Robert Klemme, Aug 9, 2006
11. ### Chris UppalGuest

bugbear wrote:

> you could implement the List interface over a linked list
> of blocks of (say) 1000 objects each.

This is (obviously) a sensible suggestion. I just want to use it to illustrate
how any implementation corresponds to a set of (possibly unstated) assumptions
about the distribution of final sizes -- which the OP claimed was unknown.

In this case one assumption is that sizes won't cluster around values
n + 1000 * m
for small n, and where n and m are integers. In particular, if the array's
final sizes tend to be < 10 (say) then the structure will consume excessive
space.

That suggests a refinement (with its own hidden assumptions): use the
start-small and keep doubling technique up to some limit (say 1000), and then
put that block onto the linked list and start again with another small array.

The assumption here, of course, is that small final sizes are reasonably
common, and that sizes just over a multiple of 1000 are reasonably common too.
If the latter assumption is incorrect then we'd only use the start small
technique for the first chunk, all subsequent ones would start out
pre-allocated to the maximum size.

A further refinement would be to use an array to hold the blocks instead of a
linked list -- that would allow constant-time random access (if that's needed).
It might be worth ensuring that the blocks' size was a power of two, so that
indexing would not require a division operation. One could use the start small
and keep doubling technique to grow that array as necessary, or one could even
use the same data-structure (recursively) to manage it.

Needless to say, all these "refinements" add complexity to the main code path,
and without the knowledge that the OP claimed not to have (access patterns and
size distributions), it's impossible even to guess whether they'd be
improvements.

-- chris

Chris Uppal, Aug 9, 2006
12. ### bugbearGuest

Chris Uppal wrote:
> bugbear wrote:
>
>
>>you could implement the List interface over a linked list
>>of blocks of (say) 1000 objects each.

>
>
> This is (obviously) a sensible suggestion. I just want to use it to illustrate...

And you did.

Excellent article.

BugBear

bugbear, Aug 9, 2006