Creating a byte[] of long size

A

Arne Vajhøj

None. But Java's int isn't going to grow wider, nor will the
type of an array's .length suddenly become non-int; too much code
would break. When Java reaches the 31-bit wall, I doubt it will
find any convenient door; Java's descendants may pass through, but
I think Java will remain stuck on this side.

In ten years, we'll all have jobs converting "legacy Java code"
to Sumatra.

If Java get 20 years as "it" and 20 years as "legacy", then
that would actually be more than OK.

Things evolve and sometimes it is better to start with a
blank sheet of paper.

64 bit array indexes, functions as first class type, bigint and
bigdecimal as language types etc..

Arne
 
A

Arne Vajhøj

Arrays can only be indexed by ints, not longs. Even if they were,
even Bill Gates could not afford enough RAM for an array of bytes, one
for each possible long.

True, but not especially relevant: You'll hit the int limit long
before running out of dollars. $50US will buy more RAM than a Java
byte[] can use.

I don't think our two posts conflict.

They don't if we assume that your post was irrelevant for the
thread.

Arne
 
A

Arne Vajhøj

Only after the fact though. Unfortunately, it doesn't provide
composition-time formatting like that.

And after all that, I'm still not clear on why Arne's figures are in
base-10. AFAIK, you can fit a full 2^31-1 (Integer.MAX_VALUE) elements
in a list. So you could fit (2^31-1)^2 elements in a list of lists.
Possibly he's just using base-10 because the numbers are easier to deal
with when rounded like that?

I considered powers of 10 more readable than powers of 2.

Arne
 
M

Mike Schilling

Arne Vajhøj said:
What counts: "what is written in the spec" or "what
should have been written in the spec" ?

When the implementations all match the latter, then the latter counts.
Honestly, the JLS isn't the Bible, and we don't have to pretend that the sun
goes around the earth.
 
M

Mike Schilling

In ten years, we'll all have jobs converting "legacy Java code"
to Sumatra.

"It was a class which is associated with the giant array of Sumatra, a
construct for which the world is not yet prepared."
 
L

Lew

Mike said:
When the implementations all match the latter, then the latter counts.
Honestly, the JLS isn't the Bible, and we don't have to pretend that the
sun goes around the earth.

Actually it's just as valid to say the Sun revolves around the Earth as the
other way, it's just that the math is so much easier heliocentrically.

There was a young lady from Bright
who traveled faster than light.
She set out one day
in a relative way
and returned the previous night.
 
C

ClassCastException

Only after the fact though. Unfortunately, it doesn't provide
composition-time formatting like that.

And after all that, I'm still not clear on why Arne's figures are in
base-10. AFAIK, you can fit a full 2^31-1 (Integer.MAX_VALUE) elements
in a list. So you could fit (2^31-1)^2 elements in a list of lists.

Which would be 2^62 - 2^32 + 1.
 
C

ClassCastException

A 64-bit hashCode() would be of little use until you got to
more than 2^32 hash buckets. Just saying.

Gets us back to the original topic. :)
I don't get it: Why not just use equals()? I guess a class
could choose not to implement Equivalence at all (and thus make itself
unusable in whatever framework relies on Equivalence), but is that an
advantage? Also, you could get a compile-time error instead of a
run-time `false' for trying to call equal() on references of dissimilar
classes; again, where's the benefit?


It might be helpful to give some examples of the "appropriate"
uses, and of the "obvious" defaults. For example, how does a HashMap
make use of a key that implements Hasher? Does it reflect on each key
its given and make a run-time choice between using hash() and
hashCode()? I don't get it ...

Note that those interfaces specify methods with an "extra" parameter
each. They're like Comparator versus compareTo/Comparable.

The purpose is clear: so a HashMap could be given, optionally, a
Hasher<K> to use in place of the keys' own hashCode methods and an
Equivalence<K> to use in place of the keys' own equals methods.

One obvious benefit is that you get rid of IdentityHashMap by folding
that functionality into plain HashMap. Instead of a separate class, you'd
get an identity hash map with

new HashMap<K,V>(new Hasher<K>() {
public long hash (K x) {
return System.identityHashCode(x);
}
}, new Equivalence<K>() {
public boolean equal (K x, K y) {
return x == y;
}
};

or with canned instances of IdentityHasher and IdentityEquivalence
provided by the library.

With this, you would also be able to get identity WeakHashMaps and the
like; by separating the "how strong is the reference" aspect into one
class and the "how is identity decided" aspect into another, you avoid a
combinatorial explosion and possible lacunae of capability (right now we
have no WeakIdentityHashMap, in particular).

You'd also be able to reduce some of the clumsier uses of HashMap to
HashSet. Picture a

class Record {
public final int id;
public final String name;
public final String address;
}

with the obvious equality semantics (all fields equal) and constructor
added.

Now throw in an Equivalence and a Hasher that use only the record's id
field.

So maybe you keep a change log for an individual person as a
List<Record>, chronological:

id 0001
name Jane Herman
address 1600 Pennsylvania Avenue

id 0001
name Jane Herman
address 18 Wisteria Lane

id 0001
name Jane Thurston
address 18 Wisteria Lane

OK, so she got voted out of office, then got married, or something like
that.

Of course you might want to throw a jumble of records in a Set and have
different ones of the above count as different.

But you might also want a record of the current state of affairs. Given a
HashSet implementation that can use a supplied Hasher and Equivalence the
way TreeSet can use an externally-supplied Comparator, and that also has
the semantics that adding an element that equals an already-present
element replaces that element with the new one, you can update the 0001
record simply by putting a more recent one into this set -- if it already
has a 0001 record, the provided Hasher and Equivalence will lead to the
new one replacing that one.

So in some contexts you can treat records identically only if they're
actually identical; in others if they have the same id; all without
monkeying with an explicit id-to-record HashMap or suchlike.

Another way to achieve this last, though, is to have a KeyExtractor<T>
interface that you implement in this case to return the id field of a
Record and a HashSet implementation that uses the object itself as the
key in its internal HashMap if no KeyExtractor is specified during
construction, and uses the supplied KeyExtractor otherwise. This is
actually closer to the conceptual truth of what you're doing in a case
like this: keying on the id field in a particular HashSet. The
implementation would be something like

public class HashSet<T> {
private HashMap<Object,T> data = new HashMap<Object,T>();
private KeyExtractor<T> ke = new KeyExtractor<T>() {
public Object getKey (T val) {
return val;
}
}

...

public T put (T newElem) {
Object key = ke.getKey(newElem);
T oldElem = data.get(key);
data.put(key, newElem);
return oldElem;
}
}

whereas the Hasher/Equivalence version would just pass the Hasher and
Equivalence to the HashMap constructor when initializing Data and not
have the key local in put, just newElem.

The really interesting thing is that we don't really need to wait for any
hypothetical future Sun (Oracle?) update to do some of this; KeyExtractor
and the above variation of HashSet can be implemented now, perhaps
calling the latter RecordMap instead as it acts as a map from key fields
of records of some sort to whole records, in the typical case, and in
fact you probably do also want to do lookups of whole records by just the
keys. And you might sometimes want to hold the records via weak or soft
references, e.g. to make it a cache. In that case you want to allow
specifying two more things, a type of reference to use (enum
ReferenceType {STRONG; SOFT; WEAK;} with default STRONG) and an optional
ExpensiveGetter that defaults to return null but can be replaced with one
whose expensiveGet() does something like, say, retrieve disk records.
Then get() calls expensiveGet() on not-found and if expensiveGet()
doesn't throw or return null, does a put() before returning the result.
You can throw in another type parameter, too:

public class RecordMap <K,V,E> {
private ExpensiveGetter<K,V,E> eg = new ExpensiveGetter<K,V,E>() {
public V expensiveGet (K key) throws E {
return null;
}
}

private HashMap<K,Object> data = new HashMap<K,Object>();

public enum ReferenceType {
STRONG {
public Object wrap (Object o) {
return o;
}
};
SOFT {
public Object wrap (Object o) {
return new SoftReference(o);
}
};
WEAK {
public Object wrap (Object o) {
return new WeakReference(o);
}
};
public abstract Object wrap (Object o);
}

private ReferenceType referenceType = ReferenceType.STRONG;

...

@SuppressWarnings("unchecked")
public V get (K key) throws E {
Object result = data.get(key);
if (result instanceof Reference) result = result.get();
if (result != null) return (V)result;
result = eg.expensiveGet(key);
if (result == null) return null;
put(key, result);
return (V)result;
}

public void put (K key, V val) {
data.put(key, referenceType.wrap(val);
}
}

This is a bit messy but it's just a quick draft. It doesn't actually
implement Map because it doesn't quite fit the Map contract in a few
places (and making it do so would be difficult, particularly since get
seems to have to be able to throw exceptions). You might want to change
ExpensiveGet to a more general BackingSource that provides both get and
put methods; puts write through to the real backing store whenever
performed as well as writing to the RecordMap in memory, making a
RecordMap with a non-default BackingSource a cache backed by something in
a two-way fashion.

I may be a bit rusty on the syntax of giving enum constants behavior,
too. Clearly in this case that's the right thing to do, from an OO
perspective, rather than having a switch clause in the put method that
could get out of synch if someone decided to add PHANTOM to the thing for
whatever reason or a future JDK added more Reference types that
influenced GC policy in as-yet-unforeseen ways.
 
C

ClassCastException

Collections is certainly solvable.


It is not a perfect solution.

When calling a library some arrays would have to be marked as
@SmallArray to indicate that you can not call with a big array, because
the method calls length.

IMO this is barking up the wrong tree. Changing existing arrays to use
long lengths is going to break a ton of stuff and be very difficult to
pull off without a LOT of headaches.

So what should be done is to introduce a parallel *new* data structure
that is a long array and is treated as a different family of types to the
existing array types. You'd createe one by using a long constant in the
square brackets: new Foo[29954683548976828345678L]. If you wanted to you
could make a "long" array new Foo [3L] that would be a long array in
terms of type compatibility while not actually being long; so you could
mix arrays of shorter-than-2^31 and longer arrays in the same code if you
had to. The Arrays class would have long-array versions of its methods
and methods to convert short to long arrays.

The trickier part is that we'd also need a bunch of new type names; Foo[]
would have to remain "a short array of Foo" so we'd need to allow, say,
Foo[L] or some such notation to mean "a long array of Foo" when an array
type needed to be specified. (The Arrays method signatures then tend to
have type parameters and argument overloads for T[] and T[L], and of
course T[L] makeLongArray <T> (T[] shortArray).)

The supersized BigCollection classes could be made before these changes,
using hierarchical array structures under the hood, and later have their
innards retrofit to use long arrays.

As for numerics using arrays, if you really need fast numerics you might
want to contemplate simply going native, as long as you wrap whole
lengthy computations in JNI rather than each little individual step;
otherwise the overhead of going down and back up through JNI all the time
will ruin performance. The downside is you lose easy portability. At some
point Java needs a good numerics library that has many cross-platform
versions and takes advantage of SIMD, pipelining, and other CPU-
enhancement tricks on the appropriate architectures. Probably this means
a kind of added language and compilers that can make a DLL implementing
JNI methods plus a .class for you out of source code with Java method
declarations that contain expressions in a subset of FORTRAN, or
something of the sort, or even just a "native math enabled compiler" that
will turn Java methods in your source code into native methods that meet
certain criteria involving basically only doing arithmetic on primitives.

Actually that might be too limiting. Really you'd need some sort of
metacompiler or templating facility. Situations like that make me want to
use Lisp macros instead of just Java, so I can have higher-level source
code that still converts into primitive-arithmetic code after
macroexpansion and can then be eligible to become optimized native math
code.

Actually, what's *really* needed is for the JVM JIT to really take
advantage of the host CPU's numerics features. The problem is that by the
time the JIT is optimizing assembly any large-scale features of the
problem (e.g., that it's doing vector sums) that could inform such
optimizations have dissolved into a low-level representation where it
can't see the forest for the trees.

Nonetheless I've seen impressive performance from JITted code on -server
class machines, especially if the source was Clojure with macros used to
reduce high-level concepts at compile time into a tight arithmetic loop
or whatever. The results are comparable to a basic stab at coding the
arithmetic loop in assembly, e.g. 7-10ns per iteration for a few fpu
mults and adds with some compare-and-tests on a GHz CPU, the kind of
speed you'd get if the loop compiled down to just the obvious FPU
instructions with fairly efficient register use but no fancy SIMD/MMX/
etc. feature use or GPU use. Java gets the same speed with the FP loop
coded in Java in the obvious way; what macros get you is the ability to
have parameters in that loop that affect its structure in various ways
and if they're set at compile time the loop's as fast as if it were
simple. A good javac implementation might get you equivalent performance
if ifs that have compile-time-constant false expressions compile away to
just the else clause or nothing and ones with compile-time-constant true
expressions become just the then clause or nothing. With Lisp eval and
JIT, though, you get the same even if some parts of the loop aren't known
until runtime, which AFAICT is pretty much impossible in plain Java.

OK, rambling a bit. The upshot is that the JIT is the place that applies
the most leverage to optimizing numerics, since it could do so across all
JVM hosted languages and not just Java. The language might better support
this if, among other things, it supported long arrays. Floating-point
types larger than double and more efficient bignum types would also go a
long way. One problem with making a more efficient bignum type is that
there's no way in Java to check if an integer operation left the carry
bit set in the CPU, so you have to make do with 31- or 63-bit chunks in
your bignums and explicit tests of the high bit everywhere. The latter's
the bigger performance killer; if Java had an "if carry" you could use
immediately following an integer arithmetic expression, you could do
things like

newLow = low1 + low2;
if-carry {
newHigh = high1 + high2 + 1;
} else
newHigh = high1 + high2;
}

with the compiler arranging it that low1 + low2 is done and stored in
some register; then the carry bit is tested; etc.

Better yet,

newLow = low1 + low2;
newHigh = high1 + high2 + carry;

would be nice! This could compile and JIT into very efficient assembly on
architectures that provide explicit add-without-first-clearing-carry
instructions as well as ordinary adds, and similarly for other arithmetic
operations, providing all the building blocks to assemble bignums right
on the chip.

Of course, the most efficient bignum implementation will also depend on
the largest one-cycle-arithmetic word size the CPU supports. Maybe it's
best if bignums are special-case library classes instead. The existing
BigInteger and BigDecimal that are base-10 would be kept for
compatibility, and new BigInt and BigFloat classes added that are binary
and maximum-speed, with a group of architecture-selected native method
implementations provided with them and the one appropriate to the current
host hardware selected on the fly by the JIT the first time a bignum
native method got called.

Of course, then this new functionality should be made available to all
JNI users: the ability to supply several versions of the native code for
different architectures, labeled in some manner, for the JIT to select.
When code that calls the native method runs the first time, the most
appropriate one will be selected and the calling method will immediately
be JITted to an optimized assembly form that calls the specific,
appropriate native method for the CPU, so subsequent calls to that
calling method will not have to repeat the test-and-selection process (on
the presumption, valid for the foreseeable future, that the CPU
architecture will not change in the middle of a single program run -- but
if, in the future, program runs can be hibernated and then resumed on
changed hardware, all JIT caches will have to be invalidated on such
occasions anyway).

So, final conclusion:
* Add a parallel collection library that allows long-indexed collections
(size, indexing methods, etc. return long). Add RandomAccessIterator to
existing Iterator and ListIterator that allows indexed-sized forward and
backward jumps. Add a RandomAccessList interface that ArrayList
implements and that provides a RandomAccessIterator. Let the new
collection interfaces add an exception type parameter that can be
thrown by the methods, so specific implementations can be backed by disk
files and throw IOException or by databases and throw SQLException,
while the basic in-memory ones would have this type parameter set to
RuntimeException to indicate no additional checked exceptions get
thrown. RandomAccessFile would be retrofit to implement
RandomAccessList<byte>.
* Long arrays would be a good idea, but add them as new, parallel data
structures to the existing arrays so as not to add too many
compatibility headaches. The supersized collection implementations would
be internally reworked to exploit the new long arrays to make them more
efficient.
* JIT should be improved to optimize the use of long arrays.
* JIT should be improved to allow native method calls whose callers get
JITted to start calling an architecture-optimized version of the native
method selected at JIT time from among several provided alternatives
based on the actual host hardware at JIT time.
* JNI toolkit should provide a way to generate such alternative version
sets of native methods.
* JIT should invalidate code caches on any session-save-and-restore on any
future occasion that adds such a capability to JVMs. Just save the
session with the cache empty, or else save hardware info with session
and invalidate if CPU arch is changed on restore.
* BigInt and BigFloat should be added to standard library, with efficient
multi-architecture native method alternative-sets for the major
arithmetic operations and specifyable binary precision in bits. (The
actual precision will be the next larger multiple of N bits, with N
usually either 32 or 64. You ask for a minimum precision and you get
the most efficient precision that's no lower.) Possibly add BigFixed, a
fixed-point non-integer type. BigInteger and BigDecimal don't change
from their present base-10 forms, again to avoid compatibility problems.
* Possibly add support for arrays that store contiguous blocks of records
of dissimilar primitive types, also to aid numerics. E.g. an array of
float float float int, float float float int blocks that gets stored as
a contiguous memory block. This might be implemented by adding a
primitiverecord class type to go along with class, interface, and enum,
which has pass-by-value semantics and can only contain primitive,
primitive array, and primitiverecord instance members. A
primitiverecord type is not a reference type! And it cannot contain any
as instance members! Perhaps it shouldn't be allowed to have instance
methods or constructors, either; all fields public and zeroed at
instance creation. Arrays of a primitiverecord type store the records as
contiguous blocks. Disallow "char" and "byte" to discourage creating
imitation-legacy COBOLish code storing non-numeric data or rigid,
brittle binary file formats; allow float, double, int, long, and
possibly allow enums. Allow whatever static members.
* In the further future, possibly add a sophisticated higher-level
numerics library that uses the above.

The above changes, taken over time and in the order specified, would help
transition Java to 64-bit architectures and ever larger applications,
data sets, and machines, as well as gaining it some respectability as a
language for performing numeric calculations, overall making it better
suited for portably implementing the very large simulations that will be
increasingly important in the future in engineering, climate science, and
numerous other fields of endeavor.
 
C

ClassCastException

If Java get 20 years as "it" and 20 years as "legacy", then that would
actually be more than OK.

Things evolve and sometimes it is better to start with a blank sheet of
paper.

64 bit array indexes, functions as first class type, bigint and
bigdecimal as language types etc..

Clojure has all of this already except 64 bit array indexes and runs on
the JVM.

Clojure doesn't even have arrays, though, unless you drop down to Java to
use Java's arrays. Clojure's collections are built on Java's arrays and
collections, so some limits might start kicking in when you got to 2^32
elements; I'm not sure how they behave if they get that big.

A *real* future-proof language would, of course, have bigint array
indexes. :)
 
C

ClassCastException

Poul-Henning Kamp often abbreviated PHK.


He still contributes to FreeBSD.


I assume you are talking about this article:

http://queue.acm.org/detail.cfm?id=1814327

He is not suggesting any custom swap to disk or anything, but just
noting that it is beneficial to keep stuff together to minimize paging.

Does the new G1 collector do that? That is, are the collection spaces:

* 2^n pages in size, for some n and
* aligned on a 2^n-page boundary?

For example, collection spaces that take up 8 pages, and span contiguous
blocks of 8 pages starting at multiples of 8, so pages 0-7 could be one
space, 8-15 another, and so on.

If so, G1 will play nicer with paging than otherwise.

More interesting from a GC developer standpoint is the possibility of
grouping data that's used together. That could mean one of two things:

1. Trying to minimize the sum of the squares of the "lengths" of
references, where the length of a reference is the collection space
number of the space containing the reference minus that of the space
containing the referent, absolute value, and "collection space number"
numbers collection spaces consecutively in virtual memory order of
location.
2. Actually observing patterns of access, such as object X and object Y
tend to go long periods unused and then both tend to get referenced
within a short time of each other. Objects that are accessed together
get preferentially put close together, preferably in the same
collection space. Thus they're likely to be paged in and out together.
Plus they are likely to have similar lifespans, so both will probably
become garbage together. This makes both the paging system and the
GC's jobs easier.

Now alternative 2 we *almost* get for free with G1, since objects that
will tend to be used together also tend to be *born* together and such
objects will *usually* end up sharing one collection space (which
presumably starts out as a TLAB, then fills up, then is allowed to decay
into garbage over time). The trick is avoiding splitting them up later
when rearranging a few live objects to make a space be pure-garbage so it
can be recycled as a TLAB or whatever. Just trying to put the few live
objects from such a space into the *same* other space rather than
scattering them about the heap might suffice. If G1 keeps a partly-filled
space as a "old object tenuring space" and puts all such objects in it
until it's full, then moves on to another, that'll do it, and it will
cause a kind of self-organizing generational-spaces pattern to emerge
within the heap that will work fairly nicely with the OS's paging; long-
lived application global objects will accumulate to a few such spaces,
for instance. I think G1 *is* designed to behave somewhat like this, but
keeping the paging system in mind suggests how it might be possible to
fine-tune it. For example, one-page collection spaces for allocating all
objects no larger than, say, 64 bytes. (Most objects are this small.
What's left is mostly large arrays, and most of these will typically be
the innards of large HashMaps, HashSets, and ArrayLists. Large
collections tend to be long-lived or even application-global. You might
want a special-case optimization here to keep any Collection object that
directly references a large array together with that array in a single
big collection space instead of separating them. Collections into which
large arrays are placed as elements pose a bit of a problem for this, but
ought to be rare.)

What can programmers do by themselves, right now?

1. Investigate GC and paging performance as a function of your choice of
data structures and their sizes. Test if a roll-your-own disk-backed
structure performs better than keeping it in memory and letting the OS
page it out, particularly if it's infrequently accessed but then
accessed in bursts.

2. Try also breaking up large data structures into ones that ought to
span only a few pages each and that keep things that are accessed
together in one substructure together.

3. Breaking the problem itself up, parallelizing it, might make the data
structures smaller. The smaller problems can be run serially,
minimizing heap size, or multithreaded.

4. Prefer immutable objects whenever possible. All modern garbage
collectors are far more efficient with new objects referencing old
ones than with old objects referencing new ones. Old objects
referencing new ones obviously contain mutable fields.

In fact, all of this points to a clear division between two fundamentally
different classes of object as far as GC and paging treatment is
concerned:

On the one hand, we have the ordinary object. It's typically small, with
only a few fields, and often these are likewise small ordinary objects
down to a small depth before you hit primitives. It's often immutable.
It's usually young and usually not long for this world. Its reference-
type fields, if it even has any, tend to refer to older objects. It plays
very nicely with garbage collection and if the GC doesn't touch garbage,
only live objects, it plays very nicely with paging too.

On the other, we have the large long-lived collection. It's typically a
HashFoo or ArrayList, less commonly a TreeFoo or LinkedList. It usually
contains a large array that's a big, contiguous block of memory full of
references. These tend to include lots of references to much-younger
objects. It tends to slow the GC down a lot because of these references
and it plays poorly with paging with the Collection object itself, the
array, and the elements get scattered all over the heap. The occasional
replacement of the array with a bigger array certainly doesn't help.

Interestingly, this suggests an alternative answer to our array issues
that may seem very counterintuitive at first. Instead of adding long
arrays in general, perhaps arrays really serve two purposes.

On the one hand, arrays of primitives serve various numeric and data-
storage purposes (including backing Strings and being IO buffers). On the
other, arrays of reference types mostly seem to serve as the backing
structures of Collections, or else as poor-man's Collections themselves.
These are two separate uses with two separate purposes.

Arrays of primitives divide into three subgroups: char arrays backing
Strings, byte array IO buffers, and numeric arrays used for numerics. The
third has the strongest case for needing to be able to grow huge and
retain high performance, so let's let arrays of int, double, long, and
float get huge.

Strings are really a peculiar form of immutable List that doesn't
actually implement the List interface and should be treated as oddball
Collections.

IO buffers never should be very large, as a rule, except for some
applications involving buffering media during playback to smooth over
hiccups in disk (or network!) traffic and meet the hard realtime
requirements of media playback to avoid visible stuttering, pausing,
glitching, skipping, framerate fluctuation, and other such artifacts that
degrade the user's enjoyment of the media. Byte arrays with the existing
size limit seem to serve these purposes well enough.

And now, Collections. Collection access is usually not too too speed
critical. A few interface call overheads and the like are generally
accepted with their use, after all.

Do Collections really need to be backed by large single arrays? Or would
performance remain adequate if they were backed by smaller arrays that
sometimes formed treelike structures?

Performance with an array-tree obviously gets a factor of K log N
multiplied into the space and time overhead, but the constant K is very
small if the maximum single array size is even 1000 cells, for a few KB
of references and a total array size in memory that lets it fit inside a
single 4KB page on a typical system. In fact with 1000 cells before
splitting K ends up being 1/log 1000, so somewhat smallish. Or put
another way, the tree's depth is the overhead factor and for collections
of 1 billion elements, almost half the present defacto size limit, this
overhead factor is just 3. In code where we are working at a fairly high
level of abstraction and tend not to need blazing speed.

So it might actually be viable to stop using arrays of references (and
arrays of char) larger than 1000, and the resulting data structures might
be only a little slower to use (and only when bigger than 1000 elements).
But they can also become a lot friendlier to the GC and paging system.
For one thing, no single contiguous huge array needs to have room found
for it by the GC. If only large arrays of non-char primitives seem to be
needed, these can be special cased by the GC (for instance, they cannot
source references to other objects).

Everything else in the heap is then small enough to fit in small
collection spaces and in single OS paging system pages, including the
bits of Collections and larger Strings. (The substring aliasing problem
with GC also is ameliorated. Small substrings (<= 1000 characters) of
large Strings would usually be able to get away with only holding a
reference to a single at-most-1000-element array of char, and the rest of
the time to two such arrays plus an at-most-1000-element array of
references.)

There's still the matter of organizing the bits of a Collection (or
String) to optimize GC and paging. Collection-backing arrays will still
be full of references to younger objects, for example. But if neighboring
objects in a collection tend to have similar lifetimes and to be used
together, the array segment referencing them and the objects themselves
could perhaps be kept together.

Ultimately it might be useful to make the GC distinguish among three
broad categories of objects: large primitive arrays and Strings (big, no
outbound references); large Collections (big; many references to younger
objects); and everything else (mostly small; few references to younger
objects). It can then optimize for these cases in some way.

I think G1 will reduce the problems that come from Java GC interacting
with paging (such as GC hauling the entire heap image into memory at
once, causing a long pause as the OS pages gigs of data from disk).
Further improvements to GC and paging performance of Java may require
researching, particularly, how large Collections interact with GC and
paging. Large Collections have a host of features that combine to make
them impact these processes severely:

* They tend to be long-lived, but referring to younger objects, which
hurts the GC.
* They tend to contain a large, contiguous array that cannot be
fragmented, limiting the GC's options. (Using arrays of size only up
to 1000 cells in collection implementations would alleviate this.)
* If traversed or random-accessed, this array will have to be paged
entirely into memory. Objects accessed at similar times won't tend to
have similar hashCode values, so large HashMaps may perform poorly when
partially paged out. GC traversal during collection will force a full
page-in. Rehashing algorithms that jump all over the array looking for
a free cell will force a full page-in, though the load factor gimmick
somewhat reins in rehashing.
* The array is occasionally copied into a bigger array, meaning more RAM
juggling by the allocator and large contiguous chunks of garbage for the
GC to eat. This also makes the Collection object itself older than the
array AND the array older than many of its cell referents at a "typical"
time. The array growings are rare, especially at larger sizes, but the
reference-age issue is continual.
* The Collection object, its array, and the elements, and possibly other
objects referenced by the Collection object such as a Comparator, are
scattered about the heap. The array becomes garbage if the Collection
does, but they're not going to be in the same collection space. Likely
the contents also become garbage at this time, most of them anyway.
They also typically need to be paged in together, but are scattered.

There are two higher-level mitigation strategies possible.

* Programmers can consider replacing HashMaps and similar data structures
that become very large and become performance bottlenecks with more
hierarchical data structures that keep related, accessed-together and
similar-lifespan objects in one substructure. The entire substructure
and many of its parts may tend to be swapped in and out together and,
eventually, to become garbage together.

Taken to an extreme, the substructures might be made immutable using
Collections.unmodifiableFoo, at least those whose membership is going
to be constant or near-constant over its lifetime (if only near-
constant, it gets copied-with-mutation). This drastically cuts down on
old-to-young references.

* New Collection classes can be made that have a tree-structure, and
possibly use not only small but *immutable* arrays and substructures
under the hood, again solving the old-to-young reference issue;
these could also help the programmer keep related (used-together and
will-die-together) objects in substructures. If the objects and the
substructure are all allocated at close to the same time, they may tend
to be close in RAM and in on collection space, making the whole lot GC
and page as one without pain.

I will now point out that Clojure has collections that internally use
tree structures and immutable arrays, so that all the "modifying"
operations return copies rather than mutate the original. The copies
share most of the original's internal structure, so the memory overhead
and copying costs of this aren't nearly as bad as it first sounds. So
using Clojure, using the clojure.lang collection classes from in Java, or
making a new Java library of similar data structures and using that could
reduce paging and GC times in your applications.
 
T

Tom Anderson

The case I'm aware of involves a TreeSet with a Comparator, that is not
consistent with the .equals methods of the TreeSet elements. The TreeSet
always goes by the Comparator results. That means the TreeSet could
contain elements a and b such that a.equals(b).

True.

Though that feel more like "TreeSet requires its Comparator to be
consistent with equals" than "TreeSet lies about the Set contract".

If i write this:

class MakeAHashOfThings {
public int h;
public int hashCode() {
return h;
}
}

Set s = new HashSet();
MakeAHashOfThings o = new MakeAHashOfThings();
o.h = 1;
s.add(o);
o.h = 2;
s.add(o);

Is HashSet now breaking the Set contract?

A contract places obligations on both parties. The Set contract requires
the implementation not to contain multiple equal items. But the TreeSet
and HashSet contracts (and classes do constitute their own contracts,
which one must agree to in order to construct them) require the calling
code to use valid Comparators and hashCodes. If the calling code violates
the terms of the contract, then the whole thing is null and void anyway.

tom
 
T

Tom Anderson

Tom said:
On 7/9/2010 12:45 PM, Tom Anderson wrote:
On Thu, 8 Jul 2010, Eric Sosman wrote:

Or, you could have BigList implement List but "lie" in its .size()
method, in somewhat the same way TreeSet "lies" about the Set contract.

How does TreeSet lie about the Set contract?

The case I'm aware of involves a TreeSet with a Comparator, that is not
consistent with the .equals methods of the TreeSet elements. The TreeSet
always goes by the Comparator results. That means the TreeSet could
contain elements a and b such that a.equals(b).

Though that feel more like "TreeSet requires its Comparator to be
consistent with equals" than "TreeSet lies about the Set contract".

If i write [...] Is HashSet now breaking the Set contract?

But TreeSet does not do completely wild, uncontracted things given an
inconsistency, the way HashSet does given a hash that is inconsistent
with equals. TreeSet is perfectly usable given a Comparator that is
internally consistent, even if it is not consistent with equals.

Another way of looking at this is "TreeSet acts as though its Comparator
is consistent with equals.".

Well *another* way (actually, the first way again) of looking at it is
that TreeSet expects its Comparator to be consistent with equals, but,
like a world-weary Latin master, tolerates failure to meet its
expectations.

Or perhaps a more neutral, and complete, way to put it is:

Note that the ordering maintained by a set (whether or not an explicit
comparator is provided) must be consistent with equals if it is to
correctly implement the Set interface. (See Comparable or Comparator for
a precise definition of consistent with equals.) This is so because the
Set interface is defined in terms of the equals operation, but a TreeSet
instance performs all element comparisons using its compareTo (or
compare) method, so two elements that are deemed equal by this method
are, from the standpoint of the set, equal. The behavior of a set is
well-defined even if its ordering is inconsistent with equals; it just
fails to obey the general contract of the Set interface.

tom
 
A

Arne Vajhøj

Clojure has all of this already except 64 bit array indexes and runs on
the JVM.

Clojure doesn't even have arrays, though,

I don't think Clojure has what it takes to become a
mainstream language.

Arne
 
C

ClassCastException

The syntax has had 50 years to become popular without making it.

Er, Clojure's only been around for maybe 5 years, tops; I think rather
less than that.

If you mean Lisp syntax in general, Clojure has a bunch of improvements/
differences over older Lisps there, which may be enough to make a
difference. And its assets in other areas (compiles to fast, widely
portable bytecode, transactional memory/other concurrency features, FP
benefits, macros, incremental development and testing at a REPL) could
outweigh syntax issues.

Keep in mind that older Lisp environments mostly required, if not
specialized hardware, then at least rather esoteric software. Proprietary
virtual machines/interpreters, poor library/OS support, emacs, etc. The
typical Lisp environment of yore was totally unsuited to developing
desktop apps or much else than command-line tools, command-line
curiosities, and CS research papers. :)

Clojure on the other hand can a) generate decent desktop apps and b)
generate decent web apps, one a traditionally important area and the
other a huge current growth area. It can be used in applets (in the
browser's JVM) and server-side (and there are already Rails-like
frameworks based on it -- Rails will be its main competition there, in
fact).

Oh, and some people LIKE the syntax. It's very simple and regular. The
people that run screaming from perl (it's line noise, I tell you! LINE
NOISE!) and find Java verbose (how many f*@!ing times must I repeat
"WeakIdentityHashMap<String,Foobar>" on this one line??? ARRRRGH) could
easily grow to love Clojure's syntax.

Really, though, having a REPL is *alone* a major feature that I miss
badly when developing in Java and similar languages. Testing Java code is
*painful*: you write a short method to do X and then you either write
version 0.1 of the program with many unimplemented features but the full
core functionality before first running any of it and ARGH! Why that f&#!
ing NPE? How the #&%! can THIS possibly be null! AAAAAGH! or you have to
spend more time writing test boilerplate than writing the actual program
(and half of THAT is boilerplate, too). A three line method to do some
iteration over an ArrayList generating a total of some sort? Ten lines of
test case, including boilerplate, setup, teardown, etc. etc. etc., plus a
couple of lines that actually call the method being tested. In a language
with a REPL? Three lines of method and then two lines at the REPL, one to
create a sample list to use as test input and one to run the method.
Maybe another one to run the method on an empty-list literal or whatever
to check the obvious corner case. Find a bug? Edit the method, recompile,
rerun the last two commands at that REPL.

Oh, and in Clojure, summing over a list becomes a one-liner: (reduce + 0
the-list). There's a kind of while loop in Clojure but you almost never
need it because it's often much, MUCH easier to express what you're doing
in terms of mapping and reducing operations over sequences of some sort,
and even then that while loop's only there because JVMs as a rule don't
do tail call optimization.
It does not seem likely that it will now.

They said that about 3D movies after almost exactly 50 years had elapsed
since the previous round of experiments in 3D cinematography flopped
horribly. Then James Cameron's big 3D FXtravaganza turned into a box-
office juggernaut. Who's laughing now?
 
L

Lew

ClassCastException said:
Testing Java code is
*painful*: you write a short method to do X and then you either write
version 0.1 of the program with many unimplemented features but the full
core functionality before first running any of it and ARGH! Why that f&#!
ing NPE? How the #&%! can THIS possibly be null! AAAAAGH! or you have to
spend more time writing test boilerplate than writing the actual program
(and half of THAT is boilerplate, too). A three line method to do some
iteration over an ArrayList generating a total of some sort? Ten lines of
test case, including boilerplate, setup, teardown, etc. etc. etc., plus a
couple of lines that actually call the method being tested.

Nonsense. Nice hyperbole, though.

For one thing, NPE is a programmer error, nearly always easily preventable.
And it takes but a moment to realize how the freak it can be null. So either
you're exaggerating or you're not so very good at Java programming and
shouldn't be drawing comparisons about languages you don't know so well.

So AAAAAGHH! stop trying to start a Language War.
 
A

Arne Vajhøj

Er, Clojure's only been around for maybe 5 years, tops; I think rather
less than that.

If you mean Lisp syntax in general, Clojure has a bunch of improvements/
differences over older Lisps there, which may be enough to make a
difference. And its assets in other areas (compiles to fast, widely
portable bytecode, transactional memory/other concurrency features, FP
benefits, macros, incremental development and testing at a REPL) could
outweigh syntax issues.

Keep in mind that older Lisp environments mostly required, if not
specialized hardware, then at least rather esoteric software. Proprietary
virtual machines/interpreters, poor library/OS support, emacs, etc. The
typical Lisp environment of yore was totally unsuited to developing
desktop apps or much else than command-line tools, command-line
curiosities, and CS research papers. :)

Clojure on the other hand can a) generate decent desktop apps and b)
generate decent web apps, one a traditionally important area and the
other a huge current growth area. It can be used in applets (in the
browser's JVM) and server-side (and there are already Rails-like
frameworks based on it -- Rails will be its main competition there, in
fact).

Unless these features are tied to the syntax, then people will prefer
a language with those features and a more standard syntax.
Oh, and some people LIKE the syntax.

Sure. But not enough to make the language a success.
They said that about 3D movies after almost exactly 50 years had elapsed
since the previous round of experiments in 3D cinematography flopped
horribly. Then James Cameron's big 3D FXtravaganza turned into a box-
office juggernaut. Who's laughing now?

Probably James Cameron.

But that is not particular relevant for the topic.

Arne
 
C

ClassCastException

Nonsense. Nice hyperbole, though.

For one thing, NPE is a programmer error, nearly always easily
preventable. And it takes but a moment to realize how the freak it can
be null.

Well, pardon me for being more humorous than exactingly precise in a MADE-
UP EXAMPLE.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top