S
supercalifragilisticexpialadiamaticonormalizeringe
Each String instance has the following fields:
private final char value[];
private final int offset;
private final int count;
private int hash;
There are 12 bytes in addition to the char array. The offset and count
fields allow quick sub-string construction, and hash is used to cache
the hashCode result.
Oh, geez, even *more* overhead. And let's not forget the array has its
own separate object header and length field!
The array may be shared by several String objects.
It usually won't be. Really, how often does anyone use .substring except
for a very short-lived object that usually is fed directly into
StringBuilder.append() or something that calls that under the hood, or
else to an I/O write operation?
In general, many trade-offs in Java, not just the decision to make every
object capable of being a lock, assume that other considerations are
more important than minimizing memory use. For example, caching the hash
code pays four bytes per String in order to have a hash code that
depends on the entire string, without paying the cost of calculating it
repeatedly when a String is used as a hash table key.
Funnily enough, using four characters (if there are that many, else the
whole string) from near the middle of the string would probably work
nearly as well, even for the fairly common cases of many strings sharing
a common prefix, suffix, or both. Strings with highly regular middles
and variable ends are not very common by contrast. And what does that
require?
int mid = length >> 1; // emphasizing that a cheap shift op works
int start = max(mid-2,0);
int end = min(mid+2, length);
int hash = 0;
int fct = 1;
for (int i = start; i < end; ++i) {
hash += fct*content;
fct *= 256;
}
For the common case of Latin-1 strings this turns the characters there
into the hash bytes directly. Throw in some unicode characters and it
gets a bit more interesting as the characters may affect two bytes of
the hash each, except the last one of the four.
Of course, they could also have used a smarter caching strategy. When is
hash caching useful? When the string's in a hash map and going to be
looked up in it frequently. But this turns into two subcases:
1. The string already in the hash map is the same *object* as the
string used for lookup.
2. The strings are not the same object, though they have the same
content.
In the latter case, the string passed to get() is obviously not interned
and is probably being constructed anew each time, likely from I/O reads.
Caching its hash is useless since it's going to be GC'd and recreated
sans cached hash. In the former case, the string probably *is* interned,
in which case the smart place for the hash cache is in the *string
interning table* rather than in the individual string objects,
particularly if you could arrange the under-the-hood implementation to
use an int[] to hold *all* the hashes instead of separate int fields all
over the system.
If, for your purposes, minimal memory use is very important, you may
want to consider other languages with other trade-offs.
And here I thought they were trying to heavily push Java for use on
mobile phones and other devices with limited memory.