'ArrayList' allowing more than 2G entries.

Tom McGlynn · Jun 18, 2009

Occasionally I need to store information in arrays which can get
larger than 2G. Since I can't make an array that long, I'd like to
have a backup in these large cases where I use something that looks
like an ArrayList except that it allows >2G entries. So it won't be a
real ArrayList, e.g., size() will return a long, but should look very
similar. This isn't too difficult to implement, but I was curious if
anyone knew of existing libraries that might already have addressed
this issue.

Thanks,
Tom McGlynn

Lord Zoltar · Jun 18, 2009

Occasionally I need to store information in arrays which can get
larger than 2G. Since I can't make an array that long, I'd like to
have a backup in these large cases where I use something that looks
like an ArrayList except that it allows >2G entries. So it won't be a
real ArrayList, e.g., size() will return a long, but should look very
similar. This isn't too difficult to implement, but I was curious if
anyone knew of existing libraries that might already have addressed
this issue.

Thanks,
Tom McGlynn

2G... do you mean 2 billion?

markspace · Jun 18, 2009

Lord said:
2G... do you mean 2 billion?

I'm sure from the context of his post me means greater than
Integer.MAX_VALUE.

Tom McGlynn · Jun 19, 2009

If you decide to do something ArrayList-like, be careful about the data
structure design and algorithms. ArrayList's algorithms for insert and
delete may not be suitable.

If you do not need ArrayList-like properties, just an array, consider a
memory mapped temporary file.

Patricia

You're right that I'm mostly interested in the array aspects. The
sizes rarely change once the arrays are allocated (or I'd have been
using ArrayList for the more tractable sizes too). I had wanted to
emulate the ArrayList interface so that the code would seem idiomatic
Java but I wouldn't typically be resizing. Essentially
all I really need is:

LongList ll = new LongList(longSize);
val = ll.get();
ll.put(longIndex, val);

With regard to using temporary files: Curiously -- I've never gotten
a decent explanation from the Sysadmins on this -- the work systems I
mostly run on have very small /tmp areas (~500 MB) so I often have
much more space in memory than in the temporary file area. Most of
the time I could ask the user where to put the backing file though...
Spud seems to indicate that that won't really work anyway. I'll take
a look.

With regard to some of the comments about the size of all of this:
Generally these are arrays of primitives or simple FlyWeights where
the state of the object is externalized to one or two primitive
arrays. So the number of actual Java objects that needs to be created
can be quite small and I can run into limits even when the total
storage requirement only slightly exceeds 2 GB. As the commentators
note though,it's esily possible for my requirements to exceed the
memory sizes that I can expect a machine to have. True enough but I
do want to be able to use the memory I do have. Machines with <~ 10 GB
physical memory aren't too hard to come by. It looks like 4 GB is now
pretty standard for a $500 machine at BestBuy.

The best solution for me would be for Java 7 to come out with long
array indices -- but I don't believe that's in the cards.

Regards,
Tom McGlynn

Andreas Leitgeb · Jun 19, 2009

Tom McGlynn said:
Essentially all I really need is:
LongList ll = new LongList(longSize);
val = ll.get();
ll.put(longIndex, val);

If all you need is put and get, then make a small class that
contains an array of arrays, and use the upper n bits for
indexing the outer array and the lower n bits on indexing
the inner ones. (lazily creating inner arrays on "put" only)

I would make the size (2^n) of the arrays much smaller than
2GB: perhaps even just 32Mill (2^25) items or so. That
way you don't waste too much on allocating only full inner
arrays, and from your description, the resulting 50 bits
maximum size may still be large enough (a million billion
Items) for practical purposes on machines in foreseeable
future. (no "640kB is large enough" reference here)

PS: I think such a class is faster self-written, reviewed and
tested, than a foreign library evaluated for usability.

Robert Klemme · Jun 19, 2009

You're right that I'm mostly interested in the array aspects. The
sizes rarely change once the arrays are allocated (or I'd have been
using ArrayList for the more tractable sizes too). I had wanted to
emulate the ArrayList interface so that the code would seem idiomatic
Java but I wouldn't typically be resizing. Essentially
all I really need is:

LongList ll = new LongList(longSize);
val = ll.get();
ll.put(longIndex, val);

Do you really need to fill the array to that size or do you need the
largish array in order to allow for long indexes? In the latter case
with sparse filling you might simply use a Map<Long,YourType>.

Kind regards

robert

Martin Gregorie · Jun 19, 2009

You're right that I'm mostly interested in the array aspects. The sizes
rarely change once the arrays are allocated (or I'd have been using
ArrayList for the more tractable sizes too). I had wanted to emulate
the ArrayList interface so that the code would seem idiomatic Java but I
wouldn't typically be resizing. Essentially all I really need is:

LongList ll = new LongList(longSize); val = ll.get();
ll.put(longIndex, val);

Some questions:
- when you say a 2G array, is that the number of elements or
the total data size?
- do the elements have a key?
- is the index also a 'key' and if so, how sparse is the array?
- is the array always guaranteed to fit into the process memory
without paging?

markspace · Jun 19, 2009

Robert said:
Do you really need to fill the array to that size or do you need the
largish array in order to allow for long indexes? In the latter case
with sparse filling you might simply use a Map<Long,YourType>.

Interesting idea! Untested:

public class LongList<T> {

private static final int BUF_SIZE = 2048;

private HashMap<Long,T[]> store = new HashMap<Long,T[]>();

public T get( long index ) {
T[] buf = store.get(index/BUF_SIZE);
if( buf == null ) {
return null;
}
return buf[ (int)(index % BUF_SIZE) ];
}

public void put( long index, T value ) {
T[] buf = store.get(index/BUF_SIZE);
if( buf == null ) {
buf = (T[]) new Object[BUF_SIZE];
store.put( index/BUF_SIZE, buf);
}
buf[(int)(index%BUF_SIZE)] = value;
}

}

Tom McGlynn · Jun 19, 2009

Some questions:
- when you say a 2G array, is that the number of elements or
the total data size?

The number of elements.

- do the elements have a key?

I'm not quite sure what you mean by this... Generally I'm only
interested accessing the data through the index.

- is the index also a 'key' and if so, how sparse is the array?

The arrays are dense, i.e., I will have an element for every index
less than the maximum.

- is the array always guaranteed to fit into the process memory
without paging?

Hmmm... In the underlying system no. I'm basically writing code to
interpret a particular file format and the format makes no real limit
on the size of the files. However I support both streaming and in-
memory access access to the data. At some point streaming will be
required for all users. Right now for many users that point is being
set because certain arrays are reaching the 2 GB limit even though a
larger array would fit within their memory could it be constructed.
The in-memory modes are much easier for the user and more flexible.
So in the context of what I'm trying to do it may be a reasonable
assumption that paging is not an issue. If their system thrashes for
in-memory access, then they've passed the limit where they need to go
to the streaming mode.

Regards,
Tom McGlynn

Roedy Green · Jun 19, 2009

Occasionally I need to store information in arrays which can get
larger than 2G. Since I can't make an array that long, I'd like to
have a backup in these large cases where I use something that looks
like an ArrayList except that it allows >2G entries. So it won't be a
real ArrayList, e.g., size() will return a long, but should look very
similar. This isn't too difficult to implement, but I was curious if
anyone knew of existing libraries that might already have addressed
this issue.

If you are doing any inserting and deleting, the slides on a giant
64-bit JNI array would be prohibitive. I thought of using a HashMap,
but it is just array inside too, and would have the same size limit
problems.

For this, you need to figure out exactly which properties of ArrayList
you need, then look for a minimal Collection that satisfies those
needs. You many not need all the bells of an long ArrayList.

Even Collection has the int limit on size(). Looks like Sun was not
thinking ahead to 64-bit Collections.

I suspect you will end up cooking up your own that is implemented
inside with [][]

It really rather odd to have a 64-bit JVM without a 64-bit indexed
array.
--
Roedy Green Canadian Mind Products
http://mindprod.com

If everyone lived the way people do in Vancouver, we would need three more entire planets to support us.
~ Guy Dauncey

Robert Klemme · Jun 19, 2009

Robert said:
Robert said:

Do you really need to fill the array to that size or do you need the
largish array in order to allow for long indexes? In the latter case
with sparse filling you might simply use a Map<Long,YourType>.

Click to expand...

Interesting idea! Untested:

public class LongList<T> {

private static final int BUF_SIZE = 2048;

private HashMap<Long,T[]> store = new HashMap<Long,T[]>();

public T get( long index ) {
T[] buf = store.get(index/BUF_SIZE);
if( buf == null ) {
return null;
}
return buf[ (int)(index % BUF_SIZE) ];
}

public void put( long index, T value ) {
T[] buf = store.get(index/BUF_SIZE);
if( buf == null ) {
buf = (T[]) new Object[BUF_SIZE];
store.put( index/BUF_SIZE, buf);
}
buf[(int)(index%BUF_SIZE)] = value;
}

}

I rather meant to use the map directly. This is likely only feasible if
the total number of elements is less than maxint, i.e. a sparse collection.

Your approach is similar to some deque implementations I have seen in
C++ world. It can also be implemented with an ArrayList as first level
collection instead of a map. I would use the ArrayList as first level
because it is more memory efficient (the array in a HashMap is usually
larger than the number of elements because of the way hash maps are
built) while having similar performance characteristics.

Btw, I'd make the chunk size configurable (i.e. via a constructor
argument) because that makes your class more flexible.

But when reading your other posting it seems that for the file case
(i.e. non streaming) you should consider memory mapped IO.

http://java.sun.com/javase/6/docs/api/java/nio/MappedByteBuffer.html

We should probably learn more about your usage patterns. Are those
elements read in order of appearance? Do you frequently have to jump
back and forth? How often is each entry accessed? Is there a
particular locality to accesses etc.?

Also, I am wondering what you do in the streaming case: your huge array
is accessed by index. But for real streaming processing you can only
access elements sequentially or at most keep n elements in memory. If
you need full indexed access in the streaming case you will have to read
the whole stream anyway and store it somewhere so you could cover this
eventually by memory mapped IO as well.

Kind regards

robert

Roedy Green · Jun 19, 2009

So in the context of what I'm trying to do it may be a reasonable
assumption that paging is not an issue. If their system thrashes for
in-memory access, then they've passed the limit where they need to go
to the streaming mode.

This kicks me into codger mode, unable to resist retelling a tale from
the 60s. Back then, CDC made the big scientific machines and they
had something called ECS (extended control storage) basically RAM
slower than regular RAM.

At BC Hydro we were solid IBM, but wanted to run some of the CDC
software. I was given the job of emulating ECS. I wrote an LRU
paging scheme to disk. To everyone's surprise, even mine, it was much
faster than REAL ECS. It turned out the locality was good, and the
real RAM was sufficiently faster than ECS that, even with the
emulation overhead, it still beat the real thing.
--
Roedy Green Canadian Mind Products
http://mindprod.com

If everyone lived the way people do in Vancouver, we would need three more entire planets to support us.
~ Guy Dauncey

Tom Anderson · Jun 20, 2009

With regard to some of the comments about the size of all of this:
Generally these are arrays of primitives or simple FlyWeights where the
state of the object is externalized to one or two primitive arrays. So
the number of actual Java objects that needs to be created can be quite
small and I can run into limits even when the total storage requirement
only slightly exceeds 2 GB.

Something to think about would be whether you can store each entry in less
space than you currently do. In the case of flyweights, you're currently
spending 32 bits on each entry, but if there are only 216 (say) distinct
objects being pointed to, you only actually need 8 bits. Rather than using
pointers, you could put your values in an array, and store unsigned byte
indices into it. Mutatis mutandum for other numbers of flyweights.

This doesn't mean you can suddenly use ArrayList, but if you do roll your
own solution, it might be a way to make it more compact.

There's a whole field of research on 'succinct data structures', which are
structures that use as little space as possible to store their contents.
Although i don't think they have anything terribly interesting to
contribute in the case of lists, they might be worth looking into.

tom

Martin Gregorie · Jun 20, 2009

The number of elements.

OK, so its a BIG array

I'm not quite sure what you mean by this... Generally I'm only
interested accessing the data through the index.

I wads trying thr find out wether access to the array was random or
sequential. Sounds like its sequential.

The arrays are dense, i.e., I will have an element for every index less
than the maximum.
OK.

Hmmm... In the underlying system no. I'm basically writing code to
interpret a particular file format and the format makes no real limit on
the size of the files. However I support both streaming and in- memory
access access to the data. At some point streaming will be required for
all users. Right now for many users that point is being set because
certain arrays are reaching the 2 GB limit even though a larger array
would fit within their memory could it be constructed. The in-memory
modes are much easier for the user and more flexible. So in the context
of what I'm trying to do it may be a reasonable assumption that paging
is not an issue. If their system thrashes for in-memory access, then
they've passed the limit where they need to go to the streaming mode.

Does this mean there's a single copy in memory that a number of clients
are accessing? It certainly sounds that way.

I'm trying to understand how the access pattern maps onto the data. So
far it seems as if a large population of clients are all reading the
array sequentially without modifying it, but that all the data doesn't
necessarily fit into memory and the time at which readers start is
arbitrary.

Its beginning to sound as though the array could be mapped to a page file
and, as Roedy suggested, a Least Recently Used (LRU) paging algorithm
would give optimum access speed for all clients. Long ago and far away
I've seen LRU give surprisingly good results, but on a modern OS
(particularly if you want a portable solution) the class containing the
array will need to implement the paging algorithm itself because modern
OSen don't give the programmer enough (any?) control over the paging
algorithm to be used on particular data structures.

Come back VME/B - all is forgiven!

markspace · Jun 20, 2009

Robert said:
I rather meant to use the map directly. This is likely only feasible if
the total number of elements is less than maxint, i.e. a sparse collection.

My code is rather rough all the way around. Although I did intend it
for contiguous rather than sparse allocation. I also like the idea of
encapsulating behavior in a class -- it can be changed later with less
hassle, at least -- rather than using a HashMap directly.

Your points are well taken, but my code was just an off the cuff
example. I have no idea if it's really a worthwhile thing to pursue.

Robert Klemme · Jun 20, 2009

Its beginning to sound as though the array could be mapped to a page file
and, as Roedy suggested, a Least Recently Used (LRU) paging algorithm
would give optimum access speed for all clients. Long ago and far away
I've seen LRU give surprisingly good results, but on a modern OS
(particularly if you want a portable solution) the class containing the
array will need to implement the paging algorithm itself because modern
OSen don't give the programmer enough (any?) control over the paging
algorithm to be used on particular data structures.

What would be the advantage of your approach over using memory mapped IO
other than that MMIO does not work for streaming? To me it sounds
inefficient to first read a gigantic file only to then create a huge
memory structure which is paged out. It seems a better approach is to
memory map the original file and have a proxy structure which gives
convenient access to the underlying data. That structure could well use
LRU for caching an active subset of the data.

Kind regards

robert

Martin Gregorie · Jun 20, 2009

What would be the advantage of your approach over using memory mapped IO
other than that MMIO does not work for streaming? To me it sounds
inefficient to first read a gigantic file only to then create a huge
memory structure which is paged out.

Agreed. I didn't think I was suggesting any particular implementation (I
didn't intend to), just to point out that a shared paging implementation
would be a good idea.

It seems a better approach is to
memory map the original file and have a proxy structure which gives
convenient access to the underlying data. That structure could well use
LRU for caching an active subset of the data.

Sure, why not? Sounds good to me. Actually, if access really is
sequential by array index, sequential paging with look-ahead would be
better than LRU, with a thread dedicated to keeping the cache full ahead
of each consuming process.

Robert Klemme · Jun 20, 2009

Sure, why not? Sounds good to me. Actually, if access really is
sequential by array index, sequential paging with look-ahead would be
better than LRU, with a thread dedicated to keeping the cache full ahead
of each consuming process.

Absolutely! Now we only need to convince Tom to disclose the usage
pattern...

Cheers

robert

Tom McGlynn · Jun 22, 2009

Absolutely! Now we only need to convince Tom to disclose the usage
pattern...

This all goes back to my first large Java project, writing a library
to read and write FITS format files. FITS is the standard format for
astronomical data and it has always allowed very large arrays and
tables. However, a dozen or so years ago when I wrote the package,
gigabyte files were very uncommon so that few if any users were
inconvenienced by the circumlocutions that were needed to handle very
large files. Now they are getting much more common and the limits s
on array sizes are beginning to pinch in a number of places: there are
files with tables or arrays with dimensions of more than 2^31 and
internal FITS structures that my implementation uses which need large
byte arrays to implement.

The library supports both streaming and random access to the internals
of the data structures. Different users may use both or either.
Inputs can be files, pipes, URLs or compressed files. So I need
something that's pretty flexible -- I may or may not be able to do
random access on the input.

I very much appreciate the comments that everyone has made. While I'd
hoped that I'd be able to piggy back on someone else's work, I think
the discussion has given me some ideas for building my own... Right
now I'm thinking of something like the following using Andreas' ideas
from above. In a second iteration this should be easily adaptable to
supporting backing files of some kind (perhaps using the input if
that's feasible or using some external storage if not). Many of you
suggested various ways of doing this kind of file backup and I suspect
I'll need to use more than one. Then BigList might allow only a
certain number of sublists to be in memory at a given time. That
would make it possible to scale the code to much larger sizes than are
available in memory, not just exceed the 2G array limit which was my
original goal.

Thanks to all,
Tom McGlynn

---- Code sketch follows, not intended for compilation ---

class<T> BigList {

final long PAGE_SIZE = 16Meg;
private List<SubList<T>> subLists;

BigList(long size) {
int listCount = size/PAGE_SIZE + 1;
subLists = new ArrayList<SubList>(listCount);
for (i=0; i<listCount; i += 1) {
subLists.put(i, new NullList());
}
}

void put(long index, T val) {
subLists.get((int)(index/PAGE_SIZE)).put((int)(index
%PAGE_SIZE), val);
}
T get(long index) {
subLists.get((int)(index/PAGE_SIZE)).get((int)(index
%PAGE_SIZE));
}

Interface SubList<T> {
T get(int i);
void put(int i, T val);
}

class NullList<T> implements SubList {
int index;
NullList(int index) {
this.index = index;
}
T get(int i) {
throw IllegalStateException("Can't do get before
put");
}
void put(int i, T val) {
// Replace the null list with a real list.
RealList<T> list = new RealList(index);
list.put(i,val);
subLists.put(index, list);
}
}

class RealList<T> implments SubList {
int index;
T[] data;
RealList(int index) {
this.index = index;
data = new T[PAGE_SIZE];
... get the data for this page ...
}
T get(int i) {
return data;
}
void put(int i, T val) {
data = val;
}
}
}
}

Daniel Pitts · Jun 24, 2009

Tom said:
Occasionally I need to store information in arrays which can get
larger than 2G. Since I can't make an array that long, I'd like to
have a backup in these large cases where I use something that looks
like an ArrayList except that it allows >2G entries. So it won't be a
real ArrayList, e.g., size() will return a long, but should look very
similar. This isn't too difficult to implement, but I was curious if
anyone knew of existing libraries that might already have addressed
this issue.

Thanks,
Tom McGlynn

Are you really storing that much in memory on a frequent occasion?
The overhead of the array alone would be 8GB of memory, plus every
object in the array has a few byte overhead, so you're looking at a
process than can only run on extremely high end, 64-bit, processors.

Maybe you want a database instead?

Callback to Perl interpreter in C--something simpler than XS?	5	Jan 25, 2013
Why ArrayList add method needs to have new object declared?	4	Feb 10, 2006
Can anyone write this recursion for simple regexp more beautifullyand clearly than the braggarts	157	Aug 29, 2009
fprintf slower than printf and redirect?	1	Nov 29, 2008
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
[ANN] JRuby 1.4.0 Released	2	Nov 2, 2009
[ANN] JRuby 1.4.0RC2 Released	0	Oct 21, 2009
[ANN] JRuby 1.4.0RC3 Released	0	Oct 28, 2009

'ArrayList' allowing more than 2G entries.

Tom McGlynn

Lord Zoltar

markspace

Tom McGlynn

Andreas Leitgeb

Robert Klemme

Martin Gregorie

markspace

Tom McGlynn

Roedy Green

Robert Klemme

Roedy Green

Tom Anderson

Martin Gregorie

markspace

Robert Klemme

Martin Gregorie

Robert Klemme

Tom McGlynn

Daniel Pitts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads