'ArrayList' allowing more than 2G entries.

Discussion in 'Java' started by Tom McGlynn, Jun 18, 2009.

  1. Tom McGlynn

    Tom McGlynn Guest

    Occasionally I need to store information in arrays which can get
    larger than 2G. Since I can't make an array that long, I'd like to
    have a backup in these large cases where I use something that looks
    like an ArrayList except that it allows >2G entries. So it won't be a
    real ArrayList, e.g., size() will return a long, but should look very
    similar. This isn't too difficult to implement, but I was curious if
    anyone knew of existing libraries that might already have addressed
    this issue.

    Thanks,
    Tom McGlynn
     
    Tom McGlynn, Jun 18, 2009
    #1
    1. Advertising

  2. Tom McGlynn

    Lord Zoltar Guest

    On Jun 18, 2:10 pm, Tom McGlynn <> wrote:
    > Occasionally I need to store information in arrays which can get
    > larger than 2G.  Since I can't make an array that long, I'd like to
    > have a backup in these large cases where I use something that looks
    > like an ArrayList except that it allows >2G entries.  So it won't be a
    > real ArrayList, e.g., size() will return a long, but should look very
    > similar.  This isn't too difficult to implement, but I was curious if
    > anyone knew of existing libraries that might already have addressed
    > this issue.
    >
    > Thanks,
    > Tom McGlynn


    2G... do you mean 2 billion?
     
    Lord Zoltar, Jun 18, 2009
    #2
    1. Advertising

  3. Tom McGlynn

    markspace Guest

    Lord Zoltar wrote:

    >
    > 2G... do you mean 2 billion?



    I'm sure from the context of his post me means greater than
    Integer.MAX_VALUE.
     
    markspace, Jun 18, 2009
    #3
  4. Tom McGlynn

    Tom McGlynn Guest

    On Jun 18, 4:50 pm, Patricia Shanahan <> wrote:
    > Tom McGlynn wrote:
    > > Occasionally I need to store information in arrays which can get
    > > larger than 2G. Since I can't make an array that long, I'd like to
    > > have a backup in these large cases where I use something that looks
    > > like an ArrayList except that it allows >2G entries. So it won't be a
    > > real ArrayList, e.g., size() will return a long, but should look very
    > > similar. This isn't too difficult to implement, but I was curious if
    > > anyone knew of existing libraries that might already have addressed
    > > this issue.

    >
    > If you decide to do something ArrayList-like, be careful about the data
    > structure design and algorithms. ArrayList's algorithms for insert and
    > delete may not be suitable.
    >
    > If you do not need ArrayList-like properties, just an array, consider a
    > memory mapped temporary file.
    >
    > Patricia


    You're right that I'm mostly interested in the array aspects. The
    sizes rarely change once the arrays are allocated (or I'd have been
    using ArrayList for the more tractable sizes too). I had wanted to
    emulate the ArrayList interface so that the code would seem idiomatic
    Java but I wouldn't typically be resizing. Essentially
    all I really need is:

    LongList ll = new LongList(longSize);
    val = ll.get();
    ll.put(longIndex, val);

    With regard to using temporary files: Curiously -- I've never gotten
    a decent explanation from the Sysadmins on this -- the work systems I
    mostly run on have very small /tmp areas (~500 MB) so I often have
    much more space in memory than in the temporary file area. Most of
    the time I could ask the user where to put the backing file though...
    Spud seems to indicate that that won't really work anyway. I'll take
    a look.

    With regard to some of the comments about the size of all of this:
    Generally these are arrays of primitives or simple FlyWeights where
    the state of the object is externalized to one or two primitive
    arrays. So the number of actual Java objects that needs to be created
    can be quite small and I can run into limits even when the total
    storage requirement only slightly exceeds 2 GB. As the commentators
    note though,it's esily possible for my requirements to exceed the
    memory sizes that I can expect a machine to have. True enough but I
    do want to be able to use the memory I do have. Machines with <~ 10 GB
    physical memory aren't too hard to come by. It looks like 4 GB is now
    pretty standard for a $500 machine at BestBuy.

    The best solution for me would be for Java 7 to come out with long
    array indices -- but I don't believe that's in the cards.

    Regards,
    Tom McGlynn
     
    Tom McGlynn, Jun 19, 2009
    #4
  5. Tom McGlynn <> wrote:
    > Essentially all I really need is:
    > LongList ll = new LongList(longSize);
    > val = ll.get();
    > ll.put(longIndex, val);


    If all you need is put and get, then make a small class that
    contains an array of arrays, and use the upper n bits for
    indexing the outer array and the lower n bits on indexing
    the inner ones. (lazily creating inner arrays on "put" only)

    I would make the size (2^n) of the arrays much smaller than
    2GB: perhaps even just 32Mill (2^25) items or so. That
    way you don't waste too much on allocating only full inner
    arrays, and from your description, the resulting 50 bits
    maximum size may still be large enough (a million billion
    Items) for practical purposes on machines in foreseeable
    future. (no "640kB is large enough" reference here)

    PS: I think such a class is faster self-written, reviewed and
    tested, than a foreign library evaluated for usability.
     
    Andreas Leitgeb, Jun 19, 2009
    #5
  6. On 19.06.2009 15:21, Tom McGlynn wrote:
    > You're right that I'm mostly interested in the array aspects. The
    > sizes rarely change once the arrays are allocated (or I'd have been
    > using ArrayList for the more tractable sizes too). I had wanted to
    > emulate the ArrayList interface so that the code would seem idiomatic
    > Java but I wouldn't typically be resizing. Essentially
    > all I really need is:
    >
    > LongList ll = new LongList(longSize);
    > val = ll.get();
    > ll.put(longIndex, val);


    Do you really need to fill the array to that size or do you need the
    largish array in order to allow for long indexes? In the latter case
    with sparse filling you might simply use a Map<Long,YourType>.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jun 19, 2009
    #6
  7. On Fri, 19 Jun 2009 06:21:53 -0700, Tom McGlynn wrote:

    > You're right that I'm mostly interested in the array aspects. The sizes
    > rarely change once the arrays are allocated (or I'd have been using
    > ArrayList for the more tractable sizes too). I had wanted to emulate
    > the ArrayList interface so that the code would seem idiomatic Java but I
    > wouldn't typically be resizing. Essentially all I really need is:
    >
    > LongList ll = new LongList(longSize); val = ll.get();
    > ll.put(longIndex, val);
    >

    Some questions:
    - when you say a 2G array, is that the number of elements or
    the total data size?
    - do the elements have a key?
    - is the index also a 'key' and if so, how sparse is the array?
    - is the array always guaranteed to fit into the process memory
    without paging?


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Jun 19, 2009
    #7
  8. Tom McGlynn

    markspace Guest

    Robert Klemme wrote:

    > Do you really need to fill the array to that size or do you need the
    > largish array in order to allow for long indexes? In the latter case
    > with sparse filling you might simply use a Map<Long,YourType>.


    Interesting idea! Untested:


    public class LongList<T> {

    private static final int BUF_SIZE = 2048;

    private HashMap<Long,T[]> store = new HashMap<Long,T[]>();

    public T get( long index ) {
    T[] buf = store.get(index/BUF_SIZE);
    if( buf == null ) {
    return null;
    }
    return buf[ (int)(index % BUF_SIZE) ];
    }

    public void put( long index, T value ) {
    T[] buf = store.get(index/BUF_SIZE);
    if( buf == null ) {
    buf = (T[]) new Object[BUF_SIZE];
    store.put( index/BUF_SIZE, buf);
    }
    buf[(int)(index%BUF_SIZE)] = value;
    }

    }
     
    markspace, Jun 19, 2009
    #8
  9. Tom McGlynn

    Tom McGlynn Guest

    On Jun 19, 12:48 pm, Martin Gregorie <>
    wrote:
    > On Fri, 19 Jun 2009 06:21:53 -0700, Tom McGlynn wrote:
    > > You're right that I'm mostly interested in the array aspects. The sizes
    > > rarely change once the arrays are allocated (or I'd have been using
    > > ArrayList for the more tractable sizes too). I had wanted to emulate
    > > the ArrayList interface so that the code would seem idiomatic Java but I
    > > wouldn't typically be resizing. Essentially all I really need is:

    >
    > > LongList ll = new LongList(longSize); val = ll.get();
    > > ll.put(longIndex, val);

    >
    > Some questions:
    > - when you say a 2G array, is that the number of elements or
    > the total data size?


    The number of elements.
    > - do the elements have a key?


    I'm not quite sure what you mean by this... Generally I'm only
    interested accessing the data through the index.

    > - is the index also a 'key' and if so, how sparse is the array?


    The arrays are dense, i.e., I will have an element for every index
    less than the maximum.

    > - is the array always guaranteed to fit into the process memory
    > without paging?
    >


    Hmmm... In the underlying system no. I'm basically writing code to
    interpret a particular file format and the format makes no real limit
    on the size of the files. However I support both streaming and in-
    memory access access to the data. At some point streaming will be
    required for all users. Right now for many users that point is being
    set because certain arrays are reaching the 2 GB limit even though a
    larger array would fit within their memory could it be constructed.
    The in-memory modes are much easier for the user and more flexible.
    So in the context of what I'm trying to do it may be a reasonable
    assumption that paging is not an issue. If their system thrashes for
    in-memory access, then they've passed the limit where they need to go
    to the streaming mode.


    Regards,
    Tom McGlynn
     
    Tom McGlynn, Jun 19, 2009
    #9
  10. Tom McGlynn

    Roedy Green Guest

    On Thu, 18 Jun 2009 11:10:02 -0700 (PDT), Tom McGlynn
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Occasionally I need to store information in arrays which can get
    >larger than 2G. Since I can't make an array that long, I'd like to
    >have a backup in these large cases where I use something that looks
    >like an ArrayList except that it allows >2G entries. So it won't be a
    >real ArrayList, e.g., size() will return a long, but should look very
    >similar. This isn't too difficult to implement, but I was curious if
    >anyone knew of existing libraries that might already have addressed
    >this issue.


    If you are doing any inserting and deleting, the slides on a giant
    64-bit JNI array would be prohibitive. I thought of using a HashMap,
    but it is just array inside too, and would have the same size limit
    problems.

    For this, you need to figure out exactly which properties of ArrayList
    you need, then look for a minimal Collection that satisfies those
    needs. You many not need all the bells of an long ArrayList.

    Even Collection has the int limit on size(). Looks like Sun was not
    thinking ahead to 64-bit Collections.

    I suspect you will end up cooking up your own that is implemented
    inside with [][]

    It really rather odd to have a 64-bit JVM without a 64-bit indexed
    array.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    If everyone lived the way people do in Vancouver, we would need three more entire planets to support us.
    ~ Guy Dauncey
     
    Roedy Green, Jun 19, 2009
    #10
  11. On 19.06.2009 18:57, markspace wrote:
    > Robert Klemme wrote:
    >
    >> Do you really need to fill the array to that size or do you need the
    >> largish array in order to allow for long indexes? In the latter case
    >> with sparse filling you might simply use a Map<Long,YourType>.

    >
    > Interesting idea! Untested:
    >
    >
    > public class LongList<T> {
    >
    > private static final int BUF_SIZE = 2048;
    >
    > private HashMap<Long,T[]> store = new HashMap<Long,T[]>();
    >
    > public T get( long index ) {
    > T[] buf = store.get(index/BUF_SIZE);
    > if( buf == null ) {
    > return null;
    > }
    > return buf[ (int)(index % BUF_SIZE) ];
    > }
    >
    > public void put( long index, T value ) {
    > T[] buf = store.get(index/BUF_SIZE);
    > if( buf == null ) {
    > buf = (T[]) new Object[BUF_SIZE];
    > store.put( index/BUF_SIZE, buf);
    > }
    > buf[(int)(index%BUF_SIZE)] = value;
    > }
    >
    > }


    I rather meant to use the map directly. This is likely only feasible if
    the total number of elements is less than maxint, i.e. a sparse collection.

    Your approach is similar to some deque implementations I have seen in
    C++ world. It can also be implemented with an ArrayList as first level
    collection instead of a map. I would use the ArrayList as first level
    because it is more memory efficient (the array in a HashMap is usually
    larger than the number of elements because of the way hash maps are
    built) while having similar performance characteristics.

    Btw, I'd make the chunk size configurable (i.e. via a constructor
    argument) because that makes your class more flexible.

    But when reading your other posting it seems that for the file case
    (i.e. non streaming) you should consider memory mapped IO.

    http://java.sun.com/javase/6/docs/api/java/nio/MappedByteBuffer.html

    We should probably learn more about your usage patterns. Are those
    elements read in order of appearance? Do you frequently have to jump
    back and forth? How often is each entry accessed? Is there a
    particular locality to accesses etc.?

    Also, I am wondering what you do in the streaming case: your huge array
    is accessed by index. But for real streaming processing you can only
    access elements sequentially or at most keep n elements in memory. If
    you need full indexed access in the streaming case you will have to read
    the whole stream anyway and store it somewhere so you could cover this
    eventually by memory mapped IO as well.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jun 19, 2009
    #11
  12. Tom McGlynn

    Roedy Green Guest

    On Fri, 19 Jun 2009 10:32:15 -0700 (PDT), Tom McGlynn
    <> wrote, quoted or indirectly quoted
    someone who said :

    >So in the context of what I'm trying to do it may be a reasonable
    >assumption that paging is not an issue. If their system thrashes for
    >in-memory access, then they've passed the limit where they need to go
    >to the streaming mode.


    This kicks me into codger mode, unable to resist retelling a tale from
    the 60s. Back then, CDC made the big scientific machines and they
    had something called ECS (extended control storage) basically RAM
    slower than regular RAM.

    At BC Hydro we were solid IBM, but wanted to run some of the CDC
    software. I was given the job of emulating ECS. I wrote an LRU
    paging scheme to disk. To everyone's surprise, even mine, it was much
    faster than REAL ECS. It turned out the locality was good, and the
    real RAM was sufficiently faster than ECS that, even with the
    emulation overhead, it still beat the real thing.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    If everyone lived the way people do in Vancouver, we would need three more entire planets to support us.
    ~ Guy Dauncey
     
    Roedy Green, Jun 19, 2009
    #12
  13. Tom McGlynn

    Tom Anderson Guest

    On Fri, 19 Jun 2009, Tom McGlynn wrote:

    >> Tom McGlynn wrote:
    >>> Occasionally I need to store information in arrays which can get
    >>> larger than 2G. Since I can't make an array that long, I'd like to
    >>> have a backup in these large cases where I use something that looks
    >>> like an ArrayList except that it allows >2G entries. So it won't be a
    >>> real ArrayList, e.g., size() will return a long, but should look very
    >>> similar. This isn't too difficult to implement, but I was curious if
    >>> anyone knew of existing libraries that might already have addressed
    >>> this issue.

    >
    > With regard to some of the comments about the size of all of this:
    > Generally these are arrays of primitives or simple FlyWeights where the
    > state of the object is externalized to one or two primitive arrays. So
    > the number of actual Java objects that needs to be created can be quite
    > small and I can run into limits even when the total storage requirement
    > only slightly exceeds 2 GB.


    Something to think about would be whether you can store each entry in less
    space than you currently do. In the case of flyweights, you're currently
    spending 32 bits on each entry, but if there are only 216 (say) distinct
    objects being pointed to, you only actually need 8 bits. Rather than using
    pointers, you could put your values in an array, and store unsigned byte
    indices into it. Mutatis mutandum for other numbers of flyweights.

    This doesn't mean you can suddenly use ArrayList, but if you do roll your
    own solution, it might be a way to make it more compact.

    There's a whole field of research on 'succinct data structures', which are
    structures that use as little space as possible to store their contents.
    Although i don't think they have anything terribly interesting to
    contribute in the case of lists, they might be worth looking into.

    tom

    --
    Operate all mechanisms!
     
    Tom Anderson, Jun 20, 2009
    #13
  14. On Fri, 19 Jun 2009 10:32:15 -0700, Tom McGlynn wrote:

    > On Jun 19, 12:48 pm, Martin Gregorie <>
    > wrote:
    >> On Fri, 19 Jun 2009 06:21:53 -0700, Tom McGlynn wrote:
    >> > You're right that I'm mostly interested in the array aspects. The
    >> > sizes rarely change once the arrays are allocated (or I'd have been
    >> > using ArrayList for the more tractable sizes too). I had wanted to
    >> > emulate the ArrayList interface so that the code would seem idiomatic
    >> > Java but I wouldn't typically be resizing. Essentially all I really
    >> > need is:

    >>
    >> > LongList ll = new LongList(longSize); val = ll.get();
    >> > ll.put(longIndex, val);

    >>
    >> Some questions:
    >> - when you say a 2G array, is that the number of elements or
    >> the total data size?

    >
    > The number of elements.


    OK, so its a BIG array

    >> - do the elements have a key?

    >
    > I'm not quite sure what you mean by this... Generally I'm only
    > interested accessing the data through the index.
    >

    I wads trying thr find out wether access to the array was random or
    sequential. Sounds like its sequential.

    >> - is the index also a 'key' and if so, how sparse is the array?

    >
    > The arrays are dense, i.e., I will have an element for every index less
    > than the maximum.
    >

    OK.

    >> - is the array always guaranteed to fit into the process memory
    >> without paging?
    >>
    >>

    > Hmmm... In the underlying system no. I'm basically writing code to
    > interpret a particular file format and the format makes no real limit on
    > the size of the files. However I support both streaming and in- memory
    > access access to the data. At some point streaming will be required for
    > all users. Right now for many users that point is being set because
    > certain arrays are reaching the 2 GB limit even though a larger array
    > would fit within their memory could it be constructed. The in-memory
    > modes are much easier for the user and more flexible. So in the context
    > of what I'm trying to do it may be a reasonable assumption that paging
    > is not an issue. If their system thrashes for in-memory access, then
    > they've passed the limit where they need to go to the streaming mode.
    >

    Does this mean there's a single copy in memory that a number of clients
    are accessing? It certainly sounds that way.

    I'm trying to understand how the access pattern maps onto the data. So
    far it seems as if a large population of clients are all reading the
    array sequentially without modifying it, but that all the data doesn't
    necessarily fit into memory and the time at which readers start is
    arbitrary.

    Its beginning to sound as though the array could be mapped to a page file
    and, as Roedy suggested, a Least Recently Used (LRU) paging algorithm
    would give optimum access speed for all clients. Long ago and far away
    I've seen LRU give surprisingly good results, but on a modern OS
    (particularly if you want a portable solution) the class containing the
    array will need to implement the paging algorithm itself because modern
    OSen don't give the programmer enough (any?) control over the paging
    algorithm to be used on particular data structures.

    Come back VME/B - all is forgiven!


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Jun 20, 2009
    #14
  15. Tom McGlynn

    markspace Guest

    Robert Klemme wrote:

    >
    > I rather meant to use the map directly. This is likely only feasible if
    > the total number of elements is less than maxint, i.e. a sparse collection.



    My code is rather rough all the way around. Although I did intend it
    for contiguous rather than sparse allocation. I also like the idea of
    encapsulating behavior in a class -- it can be changed later with less
    hassle, at least -- rather than using a HashMap directly.

    Your points are well taken, but my code was just an off the cuff
    example. I have no idea if it's really a worthwhile thing to pursue.
     
    markspace, Jun 20, 2009
    #15
  16. On 20.06.2009 03:28, Martin Gregorie wrote:

    > Its beginning to sound as though the array could be mapped to a page file
    > and, as Roedy suggested, a Least Recently Used (LRU) paging algorithm
    > would give optimum access speed for all clients. Long ago and far away
    > I've seen LRU give surprisingly good results, but on a modern OS
    > (particularly if you want a portable solution) the class containing the
    > array will need to implement the paging algorithm itself because modern
    > OSen don't give the programmer enough (any?) control over the paging
    > algorithm to be used on particular data structures.


    What would be the advantage of your approach over using memory mapped IO
    other than that MMIO does not work for streaming? To me it sounds
    inefficient to first read a gigantic file only to then create a huge
    memory structure which is paged out. It seems a better approach is to
    memory map the original file and have a proxy structure which gives
    convenient access to the underlying data. That structure could well use
    LRU for caching an active subset of the data.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jun 20, 2009
    #16
  17. On Sat, 20 Jun 2009 17:49:21 +0200, Robert Klemme wrote:

    > On 20.06.2009 03:28, Martin Gregorie wrote:
    >
    >> Its beginning to sound as though the array could be mapped to a page
    >> file and, as Roedy suggested, a Least Recently Used (LRU) paging
    >> algorithm would give optimum access speed for all clients. Long ago and
    >> far away I've seen LRU give surprisingly good results, but on a modern
    >> OS (particularly if you want a portable solution) the class containing
    >> the array will need to implement the paging algorithm itself because
    >> modern OSen don't give the programmer enough (any?) control over the
    >> paging algorithm to be used on particular data structures.

    >
    > What would be the advantage of your approach over using memory mapped IO
    > other than that MMIO does not work for streaming? To me it sounds
    > inefficient to first read a gigantic file only to then create a huge
    > memory structure which is paged out.
    >

    Agreed. I didn't think I was suggesting any particular implementation (I
    didn't intend to), just to point out that a shared paging implementation
    would be a good idea.

    > It seems a better approach is to
    > memory map the original file and have a proxy structure which gives
    > convenient access to the underlying data. That structure could well use
    > LRU for caching an active subset of the data.
    >

    Sure, why not? Sounds good to me. Actually, if access really is
    sequential by array index, sequential paging with look-ahead would be
    better than LRU, with a thread dedicated to keeping the cache full ahead
    of each consuming process.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Jun 20, 2009
    #17
  18. On 20.06.2009 18:28, Martin Gregorie wrote:

    >> It seems a better approach is to
    >> memory map the original file and have a proxy structure which gives
    >> convenient access to the underlying data. That structure could well use
    >> LRU for caching an active subset of the data.
    >>

    > Sure, why not? Sounds good to me. Actually, if access really is
    > sequential by array index, sequential paging with look-ahead would be
    > better than LRU, with a thread dedicated to keeping the cache full ahead
    > of each consuming process.


    Absolutely! Now we only need to convince Tom to disclose the usage
    pattern... :)

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jun 20, 2009
    #18
  19. Tom McGlynn

    Tom McGlynn Guest

    On Jun 20, 12:34 pm, Robert Klemme <> wrote:
    > On 20.06.2009 18:28, Martin Gregorie wrote:
    >
    > >> It seems a better approach is to
    > >> memory map the original file and have a proxy structure which gives
    > >> convenient access to the underlying data. That structure could well use
    > >> LRU for caching an active subset of the data.

    >
    > > Sure, why not? Sounds good to me. Actually, if access really is
    > > sequential by array index, sequential paging with look-ahead would be
    > > better than LRU, with a thread dedicated to keeping the cache full ahead
    > > of each consuming process.

    >
    > Absolutely! Now we only need to convince Tom to disclose the usage
    > pattern... :)
    >


    This all goes back to my first large Java project, writing a library
    to read and write FITS format files. FITS is the standard format for
    astronomical data and it has always allowed very large arrays and
    tables. However, a dozen or so years ago when I wrote the package,
    gigabyte files were very uncommon so that few if any users were
    inconvenienced by the circumlocutions that were needed to handle very
    large files. Now they are getting much more common and the limits s
    on array sizes are beginning to pinch in a number of places: there are
    files with tables or arrays with dimensions of more than 2^31 and
    internal FITS structures that my implementation uses which need large
    byte arrays to implement.

    The library supports both streaming and random access to the internals
    of the data structures. Different users may use both or either.
    Inputs can be files, pipes, URLs or compressed files. So I need
    something that's pretty flexible -- I may or may not be able to do
    random access on the input.

    I very much appreciate the comments that everyone has made. While I'd
    hoped that I'd be able to piggy back on someone else's work, I think
    the discussion has given me some ideas for building my own... Right
    now I'm thinking of something like the following using Andreas' ideas
    from above. In a second iteration this should be easily adaptable to
    supporting backing files of some kind (perhaps using the input if
    that's feasible or using some external storage if not). Many of you
    suggested various ways of doing this kind of file backup and I suspect
    I'll need to use more than one. Then BigList might allow only a
    certain number of sublists to be in memory at a given time. That
    would make it possible to scale the code to much larger sizes than are
    available in memory, not just exceed the 2G array limit which was my
    original goal.


    Thanks to all,
    Tom McGlynn

    ---- Code sketch follows, not intended for compilation ---

    class<T> BigList {

    final long PAGE_SIZE = 16Meg;
    private List<SubList<T>> subLists;

    BigList(long size) {
    int listCount = size/PAGE_SIZE + 1;
    subLists = new ArrayList<SubList>(listCount);
    for (i=0; i<listCount; i += 1) {
    subLists.put(i, new NullList());
    }
    }

    void put(long index, T val) {
    subLists.get((int)(index/PAGE_SIZE)).put((int)(index
    %PAGE_SIZE), val);
    }
    T get(long index) {
    subLists.get((int)(index/PAGE_SIZE)).get((int)(index
    %PAGE_SIZE));
    }

    Interface SubList<T> {
    T get(int i);
    void put(int i, T val);
    }

    class NullList<T> implements SubList {
    int index;
    NullList(int index) {
    this.index = index;
    }
    T get(int i) {
    throw IllegalStateException("Can't do get before
    put");
    }
    void put(int i, T val) {
    // Replace the null list with a real list.
    RealList<T> list = new RealList(index);
    list.put(i,val);
    subLists.put(index, list);
    }
    }

    class RealList<T> implments SubList {
    int index;
    T[] data;
    RealList(int index) {
    this.index = index;
    data = new T[PAGE_SIZE];
    ... get the data for this page ...
    }
    T get(int i) {
    return data;
    }
    void put(int i, T val) {
    data = val;
    }
    }
    }
    }
     
    Tom McGlynn, Jun 22, 2009
    #19
  20. Tom McGlynn

    Daniel Pitts Guest

    Tom McGlynn wrote:
    > Occasionally I need to store information in arrays which can get
    > larger than 2G. Since I can't make an array that long, I'd like to
    > have a backup in these large cases where I use something that looks
    > like an ArrayList except that it allows >2G entries. So it won't be a
    > real ArrayList, e.g., size() will return a long, but should look very
    > similar. This isn't too difficult to implement, but I was curious if
    > anyone knew of existing libraries that might already have addressed
    > this issue.
    >
    > Thanks,
    > Tom McGlynn
    >

    Are you really storing that much in memory on a frequent occasion?
    The overhead of the array alone would be 8GB of memory, plus every
    object in the array has a few byte overhead, so you're looking at a
    process than can only run on extremely high end, 64-bit, processors.

    Maybe you want a database instead?

    --
    Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
     
    Daniel Pitts, Jun 24, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Axel Dahmen
    Replies:
    2
    Views:
    986
    Axel Dahmen
    Oct 25, 2007
  2. Don Bruder
    Replies:
    3
    Views:
    1,001
    spikeysnack
    Aug 3, 2010
  3. mast4as
    Replies:
    4
    Views:
    334
    Jonathan Lee
    Aug 12, 2010
  4. Steven D'Aprano
    Replies:
    0
    Views:
    116
    Steven D'Aprano
    Dec 23, 2013
  5. Replies:
    3
    Views:
    98
    Gary Herron
    Dec 23, 2013
Loading...

Share This Page