Creating a byte[] of long size

Eric Sosman · Jul 9, 2010

Historically, each memory size has gone through a sequence of stages:

1. Nobody will ever need more than X bytes.

2. Some people do need to run multiple jobs that need a total of more
than X bytes, but no one job could possibly need that much.

3. Some jobs do need more than X bytes, but no one data structure could
possibly need that much.

4. Some data structures do need more than X bytes.

Any particular reason to believe 32 bit addressing will stick at stage
3, and not follow the normal progression to stage 4?

None. But Java's int isn't going to grow wider, nor will the
type of an array's .length suddenly become non-int; too much code
would break. When Java reaches the 31-bit wall, I doubt it will
find any convenient door; Java's descendants may pass through, but
I think Java will remain stuck on this side.

In ten years, we'll all have jobs converting "legacy Java code"
to Sumatra.

Eric Sosman · Jul 9, 2010

long size = Integer.MAX_VALUE+1;
byte [] b = new byte[size];

-possible loss of precision

How can we make an array of long size?

Click to expand...

Longs are 64 bits. So if you want a byte array big enough to contain
a long, you need new byte[8].

Arrays can only be indexed by ints, not longs. Even if they were,
even Bill Gates could not afford enough RAM for an array of bytes, one
for each possible long.

True, but not especially relevant: You'll hit the int limit long
before running out of dollars. $50US will buy more RAM than a Java
byte[] can use.

Tom McGlynn · Jul 9, 2010

I don't think the future for Java is anywhere near as bleak as you paint it.
....
The whole collections issue could be handled by creating a parallel
hierarchy based on java.util.long_collections (or something similar for
those who don't like separating words in package names). It would
replicate the class names in the java.util hierarchy, but with long
replacing int wherever necessary to remove the size limits. It could be
implemented, using arrays of arrays where necessary, without any JVM
changes.

An alternative would be to update the existing classes: overload
methods
that take int arguments to allow longs and add new methods
(e.g., longSize()) to parallel functions that return int's. Are there
discriminators between these two approaches? I don't think either
would be especially difficult to implement assuming that we have
arrays
with long indices.

I agree that the issues at the Java API and language level are
probably trivial
compared to the changes needed in the JVM.

Regards,
Tom McGlynn

BGB / cr88192 · Jul 9, 2010

Eric Sosman said:
None. But Java's int isn't going to grow wider, nor will the
type of an array's .length suddenly become non-int; too much code
would break. When Java reaches the 31-bit wall, I doubt it will
find any convenient door; Java's descendants may pass through, but
I think Java will remain stuck on this side.

In ten years, we'll all have jobs converting "legacy Java code"
to Sumatra.

more likely, they would simply expand the field to long, although they would
have to do something about the 'implicit downcasting being an error' case
(maybe softening it to 'implicit downcasting is a warning').

alternatively, a special-case implicit downcast could be supported (such as
via a modifier flag or special type), but at the cost that it would throw an
exception if the long wont fit into an int.

another issue though is that current JBC depends somewhat on it being an
int, and so expanding it to long would break the bytecode (unless they did
something hacky, like making long-based array indexing be a method call).

invokestatic "java/lang/Array/_aiload_l([IJ)I"

that or JBC having to add lots of new opcodes, but from what I can tell,
this is largely avoided (it being apparently preferable to hack over method
calls in place of expanding the core instruction set).

or whatever...

Boris Punk · Jul 9, 2010

Kevin McMurtrie said:
24GB of RAM is a standard server configuration this year. Even my
laptop has 8GB and can only run 64 bit Java. A Java array indexing
limit of 2147483647 is a growing problem, not a future problem.

Multiplexing to smaller arrays through a class isn't a great solution.
First, it's unlikely that an application needing a 2+ GB array can
tolerate the performance hit of not using an array directly. Some
critical JIT optimizations for memory caching and range checking won't
work because of the multiplexing logic. Second, such a class could not
be compatible with anything else because it can't support the Collection
design. Oracle can't define "Collection64 extends Collection" and be
done with it because such a design can not be compatible in Java.

Is it not as simple as assigning int as 64 bit and long as 128 bit in newer
versions?

Lew · Jul 9, 2010

Wayne said:
A reduction in the number of page faults. There was an interesting article about
this topic in this month's Communications of the ACM, by Poul-Jenning Kamp, who
was one of the lead developers of the FreeBSD kernel. He applied his insight
to a web proxy replacement for Squid called Varnish, and was able to replace
12 Squid machines with 3 Varnish ones. It used a modified binary heap he called
a B-heap, which respected the page size of memory. The article was titled
"You're doing It Wrong". The message I came away with was, don't ignore the
fact that computers use paging when designing large data structures. I was
thinking that lesson might apply to the OP's situation.

How big is a page for a Java program?

Be sure to account for use on multiple architectures with and without
"-XX:LargePageSizeInBytes=??".

Lew · Jul 9, 2010

Boris said:
Is it not as simple as assigning int as 64 bit and long as 128 bit in newer
versions

That depends. For whom would it be simple, and for whom not?

How much legacy code are you willing to break?

What about folks who suddenly experience performance degradation
because their existing structures suddenly don't fit in on-chip cache
any more, whose programs are suddenly twice as large, and for whom
memory accesses overall suddenly take twice as long?

Whether they need the extra capacity or not?

Much safer would be the suggestions upthread of LongCollection classes
and perhaps a new "longarray" type.

There's no question that larger structures will be needed sometimes,
but the need will not be as widespread as some here appear to think.
The pressure to have more RAM on a machine, where 640K used to be
thought plenty and is now laughably microscopic, had nothing to do
with the size of int or much to do with the size of individual data
structures, but with the number of distinct structures and logic
contained within a program and the number of programs that must run
concurrently. Even with the clear need for gigabytes to terabytes of
RAM on machines today, pretty much none of that yet is for any
building-block data structure to exceed a gigabyte or two.

Even today, as pointed out upthread, Java is perfectly capable of
handling data structures that exceed 2 GB in size, just not in a
single array or int-based structure (or in 32-bit mode). You have to
go with multi-dimensional arrays or Collection<Collection> types of
structures, but they will work.

I predict that even when the need for multi-gigabyte arrays becomes
widely acknowledged that the actual use cases for them will remain
quite specialized. I venture to say further that the OP doesn't need
one yet.

Ian Shef · Jul 9, 2010

(I suppose it's also worth pointing out that technically a List<E> can
hold more than 2^31-1 elements; the interface specification simply
stipulates that Integer.MAX_VALUE is returned by size() and -1 by
indexOf() and lastIndexOf() when the actual value is out of range of a
positive int).

<snip>

Of course, get(int index) becomes a problem and I have no idea what happens
with toArray().

Tom Anderson · Jul 9, 2010

Or, you could have BigList implement List but "lie" in its .size()
method, in somewhat the same way TreeSet "lies" about the Set contract.

How does TreeSet lie about the Set contract?

tom

Daniel Pitts · Jul 9, 2010

The case I'm aware of involves a TreeSet with a Comparator, that is not
consistent with the .equals methods of the TreeSet elements. The TreeSet
always goes by the Comparator results. That means the TreeSet could
contain elements a and b such that a.equals(b).

Patricia

It is my opinion that (given perfect hindsight), the Collections API
should have included several interfaces for comparing items, not just one.

interface Ordering<T> {
enum Order {
LESSER, EQUAL, GREATER
}

Order compare(T left, T right);
}

interface Hasher<T> {
long hash(T value);
}

interface Equivalence<T> {
boolean equal(T left, T right);
}

Then, all the appropriate Collection code could use those interfaces.
There should also be the obvious default implementations.

Not to mention that iterators should have separate methods for advancing
and retrieving, and options for non-fail-fast for certain kinds of
collections (linked lists shouldn't invalidate any iterator unless the
item itself is removed).

Eric Sosman · Jul 9, 2010

It is my opinion that (given perfect hindsight), the Collections API
should have included several interfaces for comparing items, not just one.

interface Ordering<T> {
enum Order {
LESSER, EQUAL, GREATER
}

Order compare(T left, T right);
}

The only difference I see between this and Comparator<T> is
the use of the enum instead of an int. The advantages are not
immediately obvious. Meanwhile, instead of `comp(o1,o2) <= 0' one
would write `comp(o1,o2) != Order.GREATER', which reads even less
smoothly than `! (comp(o1,o2) > 0'.

interface Hasher<T> {
long hash(T value);
}

A 64-bit hashCode() would be of little use until you got to
more than 2^32 hash buckets. Just saying.

interface Equivalence<T> {
boolean equal(T left, T right);
}

I don't get it: Why not just use equals()? I guess a class
could choose not to implement Equivalence at all (and thus make
itself unusable in whatever framework relies on Equivalence), but
is that an advantage? Also, you could get a compile-time error
instead of a run-time `false' for trying to call equal() on
references of dissimilar classes; again, where's the benefit?

Then, all the appropriate Collection code could use those interfaces.
There should also be the obvious default implementations.

It might be helpful to give some examples of the "appropriate"
uses, and of the "obvious" defaults. For example, how does a HashMap
make use of a key that implements Hasher? Does it reflect on each
key its given and make a run-time choice between using hash() and
hashCode()? I don't get it ...

Not to mention that iterators should have separate methods for advancing
and retrieving,

I guess this would allow `iter.advance(666666)', with a possible
savings if the Iterator were backed by a disk file or something, where
you could just skip over two thirds of a million items without looking
at them. But, but, but!!! this only makes sense for a sequential data
structure like a List -- and for List you've already *got* access by
index without using an Iterator at all! I'm beginning to sound like a
broken record[*], I know, but what's the benefit?

and options for non-fail-fast for certain kinds of
collections (linked lists shouldn't invalidate any iterator unless the
item itself is removed).

Not so sure about that. The notion of "what's the next item?"
can certainly change if things are added or removed "near" the point
where an Iterator is active. Or, in the `iter.advance(666666)' case,
you may find yourself at an unexpected place if somebody else has
inserted 555555 items between where you are and where you're going ...

See also ListIterator.

[*] How many programmers nowadays have ever heard a broken record?
Or dialed a telephone, or seen a core? Gad, I'm feeling as old as
Methuselah.[**]

[**] Before your time, laddie, before your time.

Roedy Green · Jul 10, 2010

Is there no BigList/BigHash in Java?

The tricky thing is, you have no giant arrays to use in
implementation.

If you want BigHash, I could write you one for a fee. It would use
arrays of arrays for addressing inside.

Roedy Green · Jul 10, 2010

Arrays can only be indexed by ints, not longs. Even if they were,
even Bill Gates could not afford enough RAM for an array of bytes, one
for each possible long.

Click to expand...

True, but not especially relevant: You'll hit the int limit long
before running out of dollars. $50US will buy more RAM than a Java
byte[] can use.

I don't think our two posts conflict.

Arne Vajhøj · Jul 10, 2010

mod typos and misstatements, of course.

What counts: "what is written in the spec" or "what
should have been written in the spec" ?

Arne

Arne Vajhøj · Jul 10, 2010

To me, it is unlikely your system will run well if this one data structure
consumes 2G of memory. (You didn't really state the application or system;
certainly there are exceptions to the rule.) I would suggest you use a
more flexible system, where you keep the data on storage (disk) and use
memory as a cache. Perhaps an ArrayList of soft references would work well.
It might even be possible in your particular case to run a daemon thread
that pre-fetches items into the cache.

You mean reinvent what the virtual memory system is already
providing for free?

Keep in mind a modern general-purpose computer will use virtual memory,
typically with 4kiB pages. Any data structure larger than that will
likely end up swapped to disk anyway.

Absolutely not.

Lots of systems does not page much even with huge data.

Arne

Arne Vajhøj · Jul 10, 2010

A reduction in the number of page faults. There was an interesting article about
this topic in this month's Communications of the ACM, by Poul-Jenning Kamp,

Poul-Henning Kamp often abbreviated PHK.

who
was one of the lead developers of the FreeBSD kernel.

He still contributes to FreeBSD.

He applied his insight
to a web proxy replacement for Squid called Varnish, and was able to replace
12 Squid machines with 3 Varnish ones. It used a modified binary heap he called
a B-heap, which respected the page size of memory. The article was titled
"You're doing It Wrong". The message I came away with was, don't ignore the
fact that computers use paging when designing large data structures. I was
thinking that lesson might apply to the OP's situation.

I assume you are talking about this article:

http://queue.acm.org/detail.cfm?id=1814327

He is not suggesting any custom swap to disk or anything, but just
noting that it is beneficial to keep stuff together to minimize
paging.

In what may actually be an earlier version of the same story:

http://www.version2.dk/artikel/1320...dommen-kan-ikke-programmere-til-et-moderne-os

translated:

http://translate.google.com/transla...kke-programmere-til-et-moderne-os&sl=da&tl=en

he explicit states:

'A classic example is that you stand and move things between disk and
memory at all times. Men det gør operativsystemkernen jo selv. But it
makes the operating system kernel yourself. Man skal bare lægge det et
sted i den virtuelle hukommelse, så gør den det selv, og det er den
meget bedre til. You just put it somewhere in the virtual memory, so it
makes it even and it is much better. Men det er folk ikke klar over,«
lyder vurderingen fra Poul-Henning Kamp. But people are not aware,
"reads the assessment from Poul-Henning Kamp.

Arne

Arne Vajhøj · Jul 10, 2010

Is it not as simple as assigning int as 64 bit and long as 128 bit in newer
versions?

No.

Lots of code would break.

Code that relies on overflow.

Code that relies on bit masks.

Arne

Arne Vajhøj · Jul 10, 2010

That depends. For whom would it be simple, and for whom not?

How much legacy code are you willing to break?

What about folks who suddenly experience performance degradation
because their existing structures suddenly don't fit in on-chip cache
any more, whose programs are suddenly twice as large, and for whom
memory accesses overall suddenly take twice as long?

Whether they need the extra capacity or not?

That type of optimizations would break anyway over time.

Code lives a lot longer than hardware.

It is not realistic to assume identical or almost identical
performance characteristics over the lifetime of code.

One of the reasons why micro optimization is rather pointless.

Even today, as pointed out upthread, Java is perfectly capable of
handling data structures that exceed 2 GB in size, just not in a
single array or int-based structure (or in 32-bit mode). You have to
go with multi-dimensional arrays or Collection<Collection> types of
structures, but they will work.

I predict that even when the need for multi-gigabyte arrays becomes
widely acknowledged that the actual use cases for them will remain
quite specialized. I venture to say further that the OP doesn't need
one yet.

I agree with that one.

Such huge arrays seems mostly a scientific computing wish.

Arne

Arne Vajhøj · Jul 10, 2010

Arne Vajhøj said:
Arne Vajhøj said:

No.

But You can have a List<List<X>> which can then
store 4*10^18 X'es.

Click to expand...

Or you could pretty easily build a class like

public class BigArray<T>
{
T get(long index);
void set(long index, T value);
}

backed by a two-dimensional array.[1] The reason I prefer to say "array"
rather than "List" is that random access into a sparse List is a bit
dicey, while arrays nicely let you set index 826727 even if you haven't
touched any of the earlier indices yet, and will tell you that the entry
at 6120584 is null, instead of throwing an exception.

Sure.

But the collections has the advantage of dynamic size.

I think the general trend is from arrays to collections.

With scientific computing as a major exception.

Arne

Arne Vajhøj · Jul 10, 2010

I don't think the future for Java is anywhere near as bleak as you paint
it.

The whole collections issue could be handled by creating a parallel
hierarchy based on java.util.long_collections (or something similar for
those who don't like separating words in package names). It would
replicate the class names in the java.util hierarchy, but with long
replacing int wherever necessary to remove the size limits. It could be
implemented, using arrays of arrays where necessary, without any JVM
changes.

To migrate a program to the new collections one would first change the
import statements to pick up the new packages, and then review all int
declarations to see if they should be long. Many of the ones that need
changing would show up as errors.

Collections is certainly solvable.

Arrays are a worse problem, requiring JVM changes. The size field
associated with an array would have to be long. There would also need to
be a new "field" longLength. Attempts to use arrayRef.length for an
array with more that Integer.MAX_VALUE elements would throw an
exception. arrayRef.length would continue to work for small arrays for
backwards compatibility.

I suspect Eclipse would have "Source -> Long Structures" soon after the
first release supporting this, and long before most programs would need
to migrate.

It is not a perfect solution.

When calling a library some arrays would have to be marked
as @SmallArray to indicate that you can not call with a
big array, because the method calls length.

There may be other problems that I can not think of.

Arne

How to try a range of hex values in C# code ?	0	Nov 19, 2022
convert byte array to hex string using BigInteger	21	Jun 20, 2013
Data saving in condition of changing reality	0	Apr 29, 2022
Getting incorrect output in finding the maximum pair sum in the given array.	7	Apr 6, 2023
Lists of size Integer.MAX_VALUE	20	Jul 7, 2007
Understanding an exercise about sub-sums in arrays	1	May 20, 2023
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
The distinction between a java applet and an application	1	Jan 4, 2023

Creating a byte[] of long size

Eric Sosman

Eric Sosman

Tom McGlynn

BGB / cr88192

Boris Punk

Lew

Lew

Ian Shef

Tom Anderson

Daniel Pitts

Eric Sosman

Roedy Green

Roedy Green

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads