Algorithm for growing arrays

Discussion in 'Java' started by Chris, Aug 8, 2006.

  1. Chris

    Chris Guest

    This is an ill-defined question, but here goes:

    My app uses lots of int arrays. We usually don't know how large the
    array will ultimately be when we first start populating it. So we grow
    the array as necessary, which involves creating a new, larger array and
    copying the data over from the old one.

    The question is, when we create a larger array, how large should it be?
    Assume that all arrays will eventually grow to a random size, unknowable
    in advance. Assume also that we want to optimize for both memory and speed.

    At one extreme, we could grow each array an element at a time, that is,
    the new size would be the old size + 1. This is memory efficient, but
    slow and generates a lot of garbage. At the other extreme, we could
    allocate huge arrays at the beginning and double them each time we
    needed more space. This would be fast because we would seldom need to
    extend the arrays, but very memory inefficient.

    What's a good tradeoff?

    Just for reference, java.util.ArrayList uses this formula internally:

    int newCapacity = (oldCapacity * 3)/2 + 1;
     
    Chris, Aug 8, 2006
    #1
    1. Advertising

  2. Chris wrote:
    > This is an ill-defined question, but here goes:
    >
    > My app uses lots of int arrays. We usually don't know how large the
    > array will ultimately be when we first start populating it. So we grow
    > the array as necessary, which involves creating a new, larger array and
    > copying the data over from the old one.
    >
    > The question is, when we create a larger array, how large should it be?
    > Assume that all arrays will eventually grow to a random size, unknowable
    > in advance. Assume also that we want to optimize for both memory and speed.


    Do you have any data about the distribution of the sizes? Do have a
    "cost" function that combines memory and speed costs? How many bytes of
    memory are you prepared to waste to avoid copying n bytes of existing array?

    If you have that sort of data, you may be able to work out an optimum
    strategy in advance.

    If not, I suggest assuming the new size will be a linear function of the
    existing size, give yourself some way to change the coefficients, and
    experiment in the working program.

    Patricia
     
    Patricia Shanahan, Aug 8, 2006
    #2
    1. Advertising

  3. In article <44d8c00e$0$24214$>,
    Chris <> wrote:

    > This is an ill-defined question, but here goes:
    >
    > My app uses lots of int arrays. We usually don't know how large the
    > array will ultimately be when we first start populating it. So we grow
    > the array as necessary, which involves creating a new, larger array and
    > copying the data over from the old one.
    >
    > The question is, when we create a larger array, how large should it be?
    > Assume that all arrays will eventually grow to a random size, unknowable
    > in advance. Assume also that we want to optimize for both memory and speed.
    >
    > At one extreme, we could grow each array an element at a time, that is,
    > the new size would be the old size + 1. This is memory efficient, but
    > slow and generates a lot of garbage. At the other extreme, we could
    > allocate huge arrays at the beginning and double them each time we
    > needed more space. This would be fast because we would seldom need to
    > extend the arrays, but very memory inefficient.
    >
    > What's a good tradeoff?
    >
    > Just for reference, java.util.ArrayList uses this formula internally:
    >
    > int newCapacity = (oldCapacity * 3)/2 + 1;


    If you've already looked into what ArrayList uses, then I think a rather
    obvious question is why not use an ArrayList to begin with?

    Using ArrayList<Integer> might not be as flexible as an array of int
    since Integer is immutable. That's easily rectified with a mutable
    class instead of Integer. Otherwise, ArrayList already has all you need
    plus the automatic ability to grow. The number of available but unused
    entries isn't really relevant since they don't contain data until you
    put something there.

    = Steve =
    --
    Steve W. Jackson
    Montgomery, Alabama
     
    Steve W. Jackson, Aug 8, 2006
    #3
  4. Chris

    Matt Rose Guest

    Chris wrote:
    > This is an ill-defined question, but here goes:
    >
    > My app uses lots of int arrays. We usually don't know how large the
    > array will ultimately be when we first start populating it. So we grow
    > the array as necessary, which involves creating a new, larger array and
    > copying the data over from the old one.
    >
    > The question is, when we create a larger array, how large should it be?
    > Assume that all arrays will eventually grow to a random size, unknowable
    > in advance. Assume also that we want to optimize for both memory and speed.
    >
    > At one extreme, we could grow each array an element at a time, that is,
    > the new size would be the old size + 1. This is memory efficient, but
    > slow and generates a lot of garbage. At the other extreme, we could
    > allocate huge arrays at the beginning and double them each time we
    > needed more space. This would be fast because we would seldom need to
    > extend the arrays, but very memory inefficient.
    >
    > What's a good tradeoff?
    >
    > Just for reference, java.util.ArrayList uses this formula internally:
    >
    > int newCapacity = (oldCapacity * 3)/2 + 1;


    caveat: I cannot condone premature optimisation... but it can be fun!

    It depends on the usage of these arrays but under some circumstances it
    might pay to create a new array each time you run out of space but
    don't copy the data across, just keep a reference in some sort of
    ordered data structure (possibly another array, as long as you're
    willing to cope with running out of space with that one) then when you
    need to access all the data just iterate over all your arrays. This
    gets more complex if the data might arrive in unpredictably sized
    chunks.

    Of course, at some point you might have to give in and declare the
    extra man hours to maintain your own malloc will cost more than the
    hardware required to just use an API List implementation...

    Matt
     
    Matt Rose, Aug 8, 2006
    #4
  5. Chris

    Luc Mercier Guest

    As long as the growth is exponential, the amortized complexity of adding
    a new element --- ie, the average on many addition of new elements ---
    is constant. That means the time required to copy the array into a
    bigger one as little influence on the runtime, compared to an array of
    the good size.
     
    Luc Mercier, Aug 8, 2006
    #5
  6. Chris

    Luc Mercier Guest

    Luc Mercier wrote:
    > As long as the growth is exponential, the amortized complexity of adding
    > a new element --- ie, the average on many addition of new elements ---
    > is constant. That means the time required to copy the array into a
    > bigger one as little influence on the runtime, compared to an array of
    > the good size.


    (Sorry, that was not finished)

    You're saying ArrayList use a factor of 3/2. For performances, at least
    for the theoretical point of view, you can as well choose a factor of
    1.1 or 2, and the runtime performance will be similar.

    But if you choose a LINEAR growth instead of EXPONENTIAL -- ie,
    newLength = oldLength + constant --, then the amortized time of the
    addition of an element become linear, so it will be MUCH slower.

    Look at any introduction to amortized complexity analysis for more details.
     
    Luc Mercier, Aug 8, 2006
    #6
  7. Luc Mercier <> writes:

    > As long as the growth is exponential, the amortized complexity of
    > adding a new element --- ie, the average on many addition of new
    > elements --- is constant. That means the time required to copy the
    > array into a bigger one as little influence on the runtime, compared
    > to an array of the good size.


    It is true that the amortized cost is constant, but if the exponent is
    close to 1.0, it can be a big constant, and the copying will have a
    significant influence on runtime.

    For example, if you double the array each time, the amortized cost
    of an insertion is three element copies (paying for its own insertion
    and the first copying of itself and one existing element).

    If you only grow the array by 10% each time, the amortized cost of an
    insertion will be 11 element copies, i.e., almost four times slower
    than for doubling (if we only count array stores as complexity, and
    not, e.g., time spent in memory allocation).


    /L
    --
    Lasse Reichstein Nielsen -
    DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
    'Faith without judgement merely degrades the spirit divine.'
     
    Lasse Reichstein Nielsen, Aug 8, 2006
    #7
  8. Chris

    bugbear Guest

    Chris wrote:
    > This is an ill-defined question, but here goes:
    >
    > My app uses lots of int arrays. We usually don't know how large the
    > array will ultimately be when we first start populating it. So we grow
    > the array as necessary, which involves creating a new, larger array and
    > copying the data over from the old one.
    >
    > The question is, when we create a larger array, how large should it be?
    > Assume that all arrays will eventually grow to a random size, unknowable
    > in advance. Assume also that we want to optimize for both memory and speed.
    >
    > At one extreme, we could grow each array an element at a time, that is,
    > the new size would be the old size + 1. This is memory efficient, but
    > slow and generates a lot of garbage. At the other extreme, we could
    > allocate huge arrays at the beginning and double them each time we
    > needed more space. This would be fast because we would seldom need to
    > extend the arrays, but very memory inefficient.
    >
    > What's a good tradeoff?


    Depending on your access patterns (not specified in your post)
    you could implement the List interface over a linked list
    of blocks of (say) 1000 objects each.

    This structure can be grown without garbage or cpu load.

    Random access on such a structure (in this instance)
    1000 time quick than a "pure" linked list, and the link
    space overhead is 1000 times lower than a pure list.

    Clearly an iterator would be very efficient.

    BugBear
     
    bugbear, Aug 9, 2006
    #8
  9. Chris

    Chris Uppal Guest

    Chris wrote:

    > Assume that all arrays will eventually grow to a random size, unknowable
    > in advance. Assume also that we want to optimize for both memory and
    > speed.


    Under those assumptions the problem is insoluble. Unless you have, at minimum,
    some knowledge of the statistical distribution of final sizes, you cannot
    optimise for both space and speed simultaneously.

    -- chris
     
    Chris Uppal, Aug 9, 2006
    #9
  10. On 09.08.2006 11:36, Chris Uppal wrote:
    > Chris wrote:
    >
    >> Assume that all arrays will eventually grow to a random size, unknowable
    >> in advance. Assume also that we want to optimize for both memory and
    >> speed.

    >
    > Under those assumptions the problem is insoluble. Unless you have, at minimum,
    > some knowledge of the statistical distribution of final sizes, you cannot
    > optimise for both space and speed simultaneously.


    Just to throw in another idea, a BitSet could be an efficient solution
    as well - but this of course depends on semantics and usage pattersn.
    Just my 0.02EUR...

    Kind regards

    robert
     
    Robert Klemme, Aug 9, 2006
    #10
  11. Chris

    Chris Uppal Guest

    bugbear wrote:

    > Depending on your access patterns (not specified in your post)
    > you could implement the List interface over a linked list
    > of blocks of (say) 1000 objects each.


    This is (obviously) a sensible suggestion. I just want to use it to illustrate
    how any implementation corresponds to a set of (possibly unstated) assumptions
    about the distribution of final sizes -- which the OP claimed was unknown.

    In this case one assumption is that sizes won't cluster around values
    n + 1000 * m
    for small n, and where n and m are integers. In particular, if the array's
    final sizes tend to be < 10 (say) then the structure will consume excessive
    space.

    That suggests a refinement (with its own hidden assumptions): use the
    start-small and keep doubling technique up to some limit (say 1000), and then
    put that block onto the linked list and start again with another small array.

    The assumption here, of course, is that small final sizes are reasonably
    common, and that sizes just over a multiple of 1000 are reasonably common too.
    If the latter assumption is incorrect then we'd only use the start small
    technique for the first chunk, all subsequent ones would start out
    pre-allocated to the maximum size.

    A further refinement would be to use an array to hold the blocks instead of a
    linked list -- that would allow constant-time random access (if that's needed).
    It might be worth ensuring that the blocks' size was a power of two, so that
    indexing would not require a division operation. One could use the start small
    and keep doubling technique to grow that array as necessary, or one could even
    use the same data-structure (recursively) to manage it.

    Needless to say, all these "refinements" add complexity to the main code path,
    and without the knowledge that the OP claimed not to have (access patterns and
    size distributions), it's impossible even to guess whether they'd be
    improvements.

    -- chris
     
    Chris Uppal, Aug 9, 2006
    #11
  12. Chris

    bugbear Guest

    Chris Uppal wrote:
    > bugbear wrote:
    >
    >
    >>Depending on your access patterns (not specified in your post)
    >>you could implement the List interface over a linked list
    >>of blocks of (say) 1000 objects each.

    >
    >
    > This is (obviously) a sensible suggestion. I just want to use it to illustrate...


    And you did.

    Excellent article.

    BugBear
     
    bugbear, Aug 9, 2006
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mahesh Prasad
    Replies:
    1
    Views:
    713
    Tom Wells
    Feb 22, 2004
  2. seabass

    reading a growing log file

    seabass, Oct 5, 2003, in forum: Perl
    Replies:
    1
    Views:
    485
    seabass
    Oct 5, 2003
  3. Philipp
    Replies:
    21
    Views:
    1,135
    Philipp
    Jan 20, 2009
  4. Joe Woodhouse

    Re: growing hash of arrays

    Joe Woodhouse, Jan 20, 2010, in forum: Perl
    Replies:
    0
    Views:
    3,541
    Joe Woodhouse
    Jan 20, 2010
  5. Haircuts Are Important
    Replies:
    1
    Views:
    245
    Eric Sosman
    May 20, 2013
Loading...

Share This Page