Algorithm for growing arrays

C

Chris

This is an ill-defined question, but here goes:

My app uses lots of int arrays. We usually don't know how large the
array will ultimately be when we first start populating it. So we grow
the array as necessary, which involves creating a new, larger array and
copying the data over from the old one.

The question is, when we create a larger array, how large should it be?
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and speed.

At one extreme, we could grow each array an element at a time, that is,
the new size would be the old size + 1. This is memory efficient, but
slow and generates a lot of garbage. At the other extreme, we could
allocate huge arrays at the beginning and double them each time we
needed more space. This would be fast because we would seldom need to
extend the arrays, but very memory inefficient.

What's a good tradeoff?

Just for reference, java.util.ArrayList uses this formula internally:

int newCapacity = (oldCapacity * 3)/2 + 1;
 
P

Patricia Shanahan

Chris said:
This is an ill-defined question, but here goes:

My app uses lots of int arrays. We usually don't know how large the
array will ultimately be when we first start populating it. So we grow
the array as necessary, which involves creating a new, larger array and
copying the data over from the old one.

The question is, when we create a larger array, how large should it be?
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and speed.

Do you have any data about the distribution of the sizes? Do have a
"cost" function that combines memory and speed costs? How many bytes of
memory are you prepared to waste to avoid copying n bytes of existing array?

If you have that sort of data, you may be able to work out an optimum
strategy in advance.

If not, I suggest assuming the new size will be a linear function of the
existing size, give yourself some way to change the coefficients, and
experiment in the working program.

Patricia
 
S

Steve W. Jackson

Chris said:
This is an ill-defined question, but here goes:

My app uses lots of int arrays. We usually don't know how large the
array will ultimately be when we first start populating it. So we grow
the array as necessary, which involves creating a new, larger array and
copying the data over from the old one.

The question is, when we create a larger array, how large should it be?
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and speed.

At one extreme, we could grow each array an element at a time, that is,
the new size would be the old size + 1. This is memory efficient, but
slow and generates a lot of garbage. At the other extreme, we could
allocate huge arrays at the beginning and double them each time we
needed more space. This would be fast because we would seldom need to
extend the arrays, but very memory inefficient.

What's a good tradeoff?

Just for reference, java.util.ArrayList uses this formula internally:

int newCapacity = (oldCapacity * 3)/2 + 1;

If you've already looked into what ArrayList uses, then I think a rather
obvious question is why not use an ArrayList to begin with?

Using ArrayList<Integer> might not be as flexible as an array of int
since Integer is immutable. That's easily rectified with a mutable
class instead of Integer. Otherwise, ArrayList already has all you need
plus the automatic ability to grow. The number of available but unused
entries isn't really relevant since they don't contain data until you
put something there.

= Steve =
 
M

Matt Rose

Chris said:
This is an ill-defined question, but here goes:

My app uses lots of int arrays. We usually don't know how large the
array will ultimately be when we first start populating it. So we grow
the array as necessary, which involves creating a new, larger array and
copying the data over from the old one.

The question is, when we create a larger array, how large should it be?
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and speed.

At one extreme, we could grow each array an element at a time, that is,
the new size would be the old size + 1. This is memory efficient, but
slow and generates a lot of garbage. At the other extreme, we could
allocate huge arrays at the beginning and double them each time we
needed more space. This would be fast because we would seldom need to
extend the arrays, but very memory inefficient.

What's a good tradeoff?

Just for reference, java.util.ArrayList uses this formula internally:

int newCapacity = (oldCapacity * 3)/2 + 1;

caveat: I cannot condone premature optimisation... but it can be fun!

It depends on the usage of these arrays but under some circumstances it
might pay to create a new array each time you run out of space but
don't copy the data across, just keep a reference in some sort of
ordered data structure (possibly another array, as long as you're
willing to cope with running out of space with that one) then when you
need to access all the data just iterate over all your arrays. This
gets more complex if the data might arrive in unpredictably sized
chunks.

Of course, at some point you might have to give in and declare the
extra man hours to maintain your own malloc will cost more than the
hardware required to just use an API List implementation...

Matt
 
L

Luc Mercier

As long as the growth is exponential, the amortized complexity of adding
a new element --- ie, the average on many addition of new elements ---
is constant. That means the time required to copy the array into a
bigger one as little influence on the runtime, compared to an array of
the good size.
 
L

Luc Mercier

Luc said:
As long as the growth is exponential, the amortized complexity of adding
a new element --- ie, the average on many addition of new elements ---
is constant. That means the time required to copy the array into a
bigger one as little influence on the runtime, compared to an array of
the good size.

(Sorry, that was not finished)

You're saying ArrayList use a factor of 3/2. For performances, at least
for the theoretical point of view, you can as well choose a factor of
1.1 or 2, and the runtime performance will be similar.

But if you choose a LINEAR growth instead of EXPONENTIAL -- ie,
newLength = oldLength + constant --, then the amortized time of the
addition of an element become linear, so it will be MUCH slower.

Look at any introduction to amortized complexity analysis for more details.
 
L

Lasse Reichstein Nielsen

Luc Mercier said:
As long as the growth is exponential, the amortized complexity of
adding a new element --- ie, the average on many addition of new
elements --- is constant. That means the time required to copy the
array into a bigger one as little influence on the runtime, compared
to an array of the good size.

It is true that the amortized cost is constant, but if the exponent is
close to 1.0, it can be a big constant, and the copying will have a
significant influence on runtime.

For example, if you double the array each time, the amortized cost
of an insertion is three element copies (paying for its own insertion
and the first copying of itself and one existing element).

If you only grow the array by 10% each time, the amortized cost of an
insertion will be 11 element copies, i.e., almost four times slower
than for doubling (if we only count array stores as complexity, and
not, e.g., time spent in memory allocation).


/L
 
B

bugbear

Chris said:
This is an ill-defined question, but here goes:

My app uses lots of int arrays. We usually don't know how large the
array will ultimately be when we first start populating it. So we grow
the array as necessary, which involves creating a new, larger array and
copying the data over from the old one.

The question is, when we create a larger array, how large should it be?
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and speed.

At one extreme, we could grow each array an element at a time, that is,
the new size would be the old size + 1. This is memory efficient, but
slow and generates a lot of garbage. At the other extreme, we could
allocate huge arrays at the beginning and double them each time we
needed more space. This would be fast because we would seldom need to
extend the arrays, but very memory inefficient.

What's a good tradeoff?

Depending on your access patterns (not specified in your post)
you could implement the List interface over a linked list
of blocks of (say) 1000 objects each.

This structure can be grown without garbage or cpu load.

Random access on such a structure (in this instance)
1000 time quick than a "pure" linked list, and the link
space overhead is 1000 times lower than a pure list.

Clearly an iterator would be very efficient.

BugBear
 
C

Chris Uppal

Chris said:
Assume that all arrays will eventually grow to a random size, unknowable
in advance. Assume also that we want to optimize for both memory and
speed.

Under those assumptions the problem is insoluble. Unless you have, at minimum,
some knowledge of the statistical distribution of final sizes, you cannot
optimise for both space and speed simultaneously.

-- chris
 
R

Robert Klemme

Under those assumptions the problem is insoluble. Unless you have, at minimum,
some knowledge of the statistical distribution of final sizes, you cannot
optimise for both space and speed simultaneously.

Just to throw in another idea, a BitSet could be an efficient solution
as well - but this of course depends on semantics and usage pattersn.
Just my 0.02EUR...

Kind regards

robert
 
C

Chris Uppal

bugbear said:
Depending on your access patterns (not specified in your post)
you could implement the List interface over a linked list
of blocks of (say) 1000 objects each.

This is (obviously) a sensible suggestion. I just want to use it to illustrate
how any implementation corresponds to a set of (possibly unstated) assumptions
about the distribution of final sizes -- which the OP claimed was unknown.

In this case one assumption is that sizes won't cluster around values
n + 1000 * m
for small n, and where n and m are integers. In particular, if the array's
final sizes tend to be < 10 (say) then the structure will consume excessive
space.

That suggests a refinement (with its own hidden assumptions): use the
start-small and keep doubling technique up to some limit (say 1000), and then
put that block onto the linked list and start again with another small array.

The assumption here, of course, is that small final sizes are reasonably
common, and that sizes just over a multiple of 1000 are reasonably common too.
If the latter assumption is incorrect then we'd only use the start small
technique for the first chunk, all subsequent ones would start out
pre-allocated to the maximum size.

A further refinement would be to use an array to hold the blocks instead of a
linked list -- that would allow constant-time random access (if that's needed).
It might be worth ensuring that the blocks' size was a power of two, so that
indexing would not require a division operation. One could use the start small
and keep doubling technique to grow that array as necessary, or one could even
use the same data-structure (recursively) to manage it.

Needless to say, all these "refinements" add complexity to the main code path,
and without the knowledge that the OP claimed not to have (access patterns and
size distributions), it's impossible even to guess whether they'd be
improvements.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top