new String ( byte[] , encoding ) under the hood

R

Roedy Green

I was curious how new String ( byte[], encoding ) could guess the
correct size of the buffer to convert into String.

It makes an estimate based on number of bytes times the max number of
chars per byte, an attribute of the encoding. This will be slightly
on the high side if there are any multibyte chars, but accurate for
Latin-1. It then decodes, and calls trim to System.arraycopy to get an
char[] the right size. The new String then does another
System.arraycopy.

You leave in your wake the original byte[], two char[] and the string.

Going the other way String -> byte uses similar logic, but the buffer
size is not so fortunate. For UTF-8 it makes the conservative
assumption each char might need 3 bytes, making the buffer 3 times
bigger than it needs to be in the ordinary case.

Sun could optimise could streamline these operations to cut out the
intermediate objects.

Here's an idea. Why not allow strings and char arrays etc to
temporarily be too big. They are logically sized. Only on the next GC
do the objects get pruned to size if need be. You would save a lot of
copying and new object creating just to get arrays the precise correct
size. There would be a method to prune an array to size that just
logically chopped it and marked it for later true pruning. Most of the
time though such objects will soon be discarded, and you then get away
without ever doing the copy.
 
S

Stefan Schulz

Here's an idea. Why not allow strings and char arrays etc to
temporarily be too big. They are logically sized. Only on the next GC
do the objects get pruned to size if need be. You would save a lot of
copying and new object creating just to get arrays the precise correct
size. There would be a method to prune an array to size that just
logically chopped it and marked it for later true pruning. Most of the
time though such objects will soon be discarded, and you then get away
without ever doing the copy.

I would mainly object that doing things this way would make the garbage
collector do work it is not normally supposed to do, that is, the
housekeeping of the String class. The garbage collector is supposed to
reclaim unreachable objects. It is pretty good at that. It is not
supposed to do much else.

If this truely and absolutely does become the bottleneck of your
application, i would suggest doing it "by hand" in a more efficient way
(for example, re-using one char [] as decode buffer).
 
C

Chris Uppal

Roedy said:
Here's an idea. Why not allow strings and char arrays etc to
temporarily be too big. They are logically sized. Only on the next GC
do the objects get pruned to size if need be. You would save a lot of
copying and new object creating just to get arrays the precise correct
size. There would be a method to prune an array to size that just
logically chopped it and marked it for later true pruning. Most of the
time though such objects will soon be discarded, and you then get away
without ever doing the copy.

Some GCed languages do allow you to change the size of array objects,
and that /could/ be implemented in the way you describe (though I'm not
sure it'd be worth it). Off the top of my head, I cannot think of a
persuasive reason why Java does not allow dynamic resizing of arrays.

-- chris
 
M

Mike Schilling

Roedy Green said:
I was curious how new String ( byte[], encoding ) could guess the
correct size of the buffer to convert into String.

It makes an estimate based on number of bytes times the max number of
chars per byte, an attribute of the encoding. This will be slightly
on the high side if there are any multibyte chars, but accurate for
Latin-1. It then decodes, and calls trim to System.arraycopy to get an
char[] the right size. The new String then does another
System.arraycopy.

Why doesn't it just create the String via new String(char[], int, int),
which would eliminate the extra copy? Better still would be a String
constructor that takes an array of character arrays and a total length,
(say, new String(char[][], int) to eliminate the need to allocate a big
contiguous character array in the first place. Instead, smaller buffers
(say, 4K) could be allocated as required. String always has to copy all the
characters to make an immutable char array, the place to achieve savings
would be before this.
 
R

Roedy Green

Some GCed languages do allow you to change the size of array objects,
and that /could/ be implemented in the way you describe (though I'm not
sure it'd be worth it). Off the top of my head, I cannot think of a
persuasive reason why Java does not allow dynamic resizing of arrays.

resize down should be easier than resize up. With down you dont HAVE
to change any allocation right way. Resize down in very common. Resize
up is usually done with ArrayList where you allocate new ram and copy.
 
R

Roedy Green

String always has to copy all the
characters to make an immutable char array, the place to achieve savings
would be before this.

I wonder if there could be some way to hand off a char array to be
inserted in a string. The problem is ensuring nobody holds onto a
reference to it.

That way you could avoid the copy. It gets quite silly how much
copying goes on to do the simplest things.

There needs to me some low level, perhaps even hardware mechanism to
hand off a chunk of RAM in such a way the original owner can no longer
meddle with it, and perhaps not even see it. It might be done by
remapping a vm page to another address.
 
S

Stefan Schulz

Roedy said:
There needs to me some low level, perhaps even hardware mechanism to
hand off a chunk of RAM in such a way the original owner can no longer
meddle with it, and perhaps not even see it. It might be done by
remapping a vm page to another address.

This is something you generally can not truely control, unless you
remap the entire page to be read-only afterwards (which is, by the way,
far below java's threshold when it comes to OS specific features)

Also, as much as it makes me sound like a spoilsport: Memory copies are
cheap. I would be extremely surprised of the string copying made up
even 1% of your typical applications time.

To phrase things a bit differently: Before worrying about how you
handle your strings, worry about the performance of your XML parser,
your database driver and GUI. ;)
 
M

Mike Schilling

Why doesn't it just create the String via new String(char[], int, int),
which would eliminate the extra copy? Better still would be a String
constructor that takes an array of character arrays and a total length,
(say, new String(char[][], int) to eliminate the need to allocate a big
contiguous character array in the first place. Instead, smaller buffers
(say, 4K) could be allocated as required. String always has to copy all
the characters to make an immutable char array, the place to achieve
savings would be before this.

Let me refine this. What I'd really like to see is

new String(Reader rdr, int maxLength)

Create a string from characters read from rdr, with the lenght of the
resulting string being the minimum of

1. The number of characters that can be read from rdr, and
2. maxLen

Likewise

new String(InputStream strm, String encoding, int maxBytes)

This would eliminate the need to move all of the chanracters into a
contiguous array to be copied to a second contiguous array. It sould be of
great help in, for instance, XML parsers, where text that needs to be put
into a string quite possibly crosses buffer boundaries.
 
C

Chris Uppal

I said:
Some GCed languages do allow you to change the size of array objects,
and that /could/ be implemented in the way you describe (though I'm not
sure it'd be worth it). Off the top of my head, I cannot think of a
persuasive reason why Java does not allow dynamic resizing of arrays.

Having thought more about it, I suspect that the problem is that there is no
safe /and/ fast way of allowing multi-threaded access to an array which can
change size.

If you get the threading wrong for accessing an array of fixed size, then all
that happens is that your application reads the wrong data. If the code that
is checking the array bounds uses a stale size value, then you can break
security and/or crash the JVM. The latter possibilities -- not unreasonably --
are considered unreasonable ;-)

-- chris
 
C

Chris Uppal

Mike said:
new String(Reader rdr, int maxLength)

One problem with that is that it creates a dependency from the -- very core --
String class to the -- rather non-core -- IO classes. I think that would be
undesirable, even though your proposed methods otherwise make a lot of sense.

(Incidentally, I've just realised that similar thinking may underlie the
absence of a form of String.split() which takes a compiled regexp rather than a
String.)

It's a bit awkward really. It would be nice to have such things, but when (a)
Java lacks "open" classes, and (b) String is declared final, there's not a lot
a room for manoeuvre.

It would be nice to create alternative kinds of String with different internal
implementations -- such as using a UTF-32 or UTF-8 encoding internally, or
using a variable-sized collection of char[] arrays to hold their data. Sadly
we cannot. I don't think that making String final was such a silly idea /at
the time/ but in retrospect I think it was an unfortunate choice.

-- chris
 
R

Robert Klemme

Chris said:
One problem with that is that it creates a dependency from the --
very core -- String class to the -- rather non-core -- IO classes.
I think that would be undesirable, even though your proposed methods
otherwise make a lot of sense.

Totally agree.
(Incidentally, I've just realised that similar thinking may underlie
the absence of a form of String.split() which takes a compiled regexp
rather than a String.)

Yeah, likely.
It's a bit awkward really. It would be nice to have such things, but
when (a) Java lacks "open" classes, and (b) String is declared final,
there's not a lot a room for manoeuvre.

That's an euphemism. :)
It would be nice to create alternative kinds of String with different
internal implementations -- such as using a UTF-32 or UTF-8 encoding
internally, or using a variable-sized collection of char[] arrays to
hold their data. Sadly we cannot. I don't think that making String
final was such a silly idea /at the time/ but in retrospect I think
it was an unfortunate choice.

I'm not so sure. After all, what's the advantage of subclassing String
when there's CharSequence? Granted, it came in quite late and quite some
methods provide only for String arguments instead of CharSequence. But
personally I never run into a situation where I actually whished I had a
String with these properties. YMMV though.

Kind regards

robert
 
C

Chris Uppal

Robert said:
It would be nice to create alternative kinds of String with different
internal implementations -- such as using a UTF-32 or UTF-8 encoding
internally, or using a variable-sized collection of char[] arrays to
hold their data. Sadly we cannot. I don't think that making String
final was such a silly idea /at the time/ but in retrospect I think
it was an unfortunate choice.

I'm not so sure. After all, what's the advantage of subclassing String
when there's CharSequence? Granted, it came in quite late and quite some
methods provide only for String arguments instead of CharSequence.

Agreed, up to a point. My reservations being: (A) CharSequence is little used
in practise, and I don't see much chance of that changing. (A) "CharSequence"
is a silly name for wide use; "String" is the only reasonable name. (B) The
CharSequence interface is too narrow -- it doesn't correspond to the abstract
API of a String.

(Which, now I come to think of it, is quite a long list of reservations --
perhaps I shouldn't have started by saying "Agreed" ;-)

-- chris
 
R

Robert Klemme

Chris said:
Robert said:
It would be nice to create alternative kinds of String with
different internal implementations -- such as using a UTF-32 or
UTF-8 encoding internally, or using a variable-sized collection of
char[] arrays to hold their data. Sadly we cannot. I don't think
that making String final was such a silly idea /at the time/ but in
retrospect I think it was an unfortunate choice.

I'm not so sure. After all, what's the advantage of subclassing
String when there's CharSequence? Granted, it came in quite late
and quite some methods provide only for String arguments instead of
CharSequence.

Agreed, up to a point. My reservations being: (A) CharSequence is
little used in practise, and I don't see much chance of that
changing.

I guess that's a problem of education caused by the late arrival of CS.
(A) "CharSequence" is a silly name for wide use; "String"
is the only reasonable name.

Well, String is definitely much better although, given the circumstances,
I find CharSequence describes pretty much what it's about - again, the
late arrival...
(B) The CharSequence interface is too
narrow -- it doesn't correspond to the abstract API of a String.

I on the other hand think that String's interface is bloated. There's a
lot stuff in there that probably doesn't belong there. CharSequence
contains really the basic stuff for an immutable string but String
contains a lot string processing that IMHO doesn't necessarily belong
there (split() for example, maybe concat() and replace*(), other methods
which have in part been flagged deprecated). But it's certainly
debatable. While for example a general indexOf algorithm could well be
put into some class as static method which solely relies on CharSequence,
implementations for String are likely much more efficient if implemented
in class String itself. Library design isn't easy... :)
(Which, now I come to think of it, is quite a long list of
reservations -- perhaps I shouldn't have started by saying "Agreed"
;-)

LOL

Kind regards

robert
 
G

Googmeister

Chris said:
I don't think that making String final was such a silly idea /at
the time/ but in retrospect I think it was an unfortunate choice.

The Java designers wanted String to be immutable, and this is
probably the reason it is final. Same with Integer, etc.

But maybe I am wrong, since StringBuilder and StringBuffer
are also final, even though they are mutable.
 
M

Mike Schilling

Chris Uppal said:
One problem with that is that it creates a dependency from the -- very
core --
String class to the -- rather non-core -- IO classes. I think that would
be
undesirable, even though your proposed methods otherwise make a lot of
sense.

As opposed to System.out and Exception.printStackTrace(Writer) :)

Java implementations must, by license, implement java.io every bit as fully
as java.lang.
 
M

Mike Schilling

Googmeister said:
The Java designers wanted String to be immutable, and this is
probably the reason it is final. Same with Integer, etc.

But maybe I am wrong, since StringBuilder and StringBuffer
are also final, even though they are mutable.

They are (or were pre-1.5 at least) intimate enough with String that making
them unfinal would introduced holes into String's immutability.
 
T

Thomas Hawtin

Mike said:
They are (or were pre-1.5 at least) intimate enough with String that making
them unfinal would introduced holes into String's immutability.

That doesn't explain why StringBuilder is final.

There are other security reasons for requiring final. Say I had a
security conscious class and I allowed some method that took a
StringBuilder and appended some private objects to it. Now imagine
someone malicious comes along and overrides
StringBuilder.append(Object). Malicious code now has access to my
sensitive object. Presumably that is why ObjectOutputStream.writeObject
is final, and why writeObjectOverride and the auditSubclass nonsense was
introduced.

Tom Hawtin
 
R

Roedy Green

I on the other hand think that String's interface is bloated.

So where should such methods go?

What are the disadvantages of putting them on String?

Coding convenience having all the methods at hand for String is a big
plus.
 
R

Roedy Green

But maybe I am wrong, since StringBuilder and StringBuffer
are also final, even though they are mutable.

Two thoughts. When reading code, you have no doubts about what a
String or StringBuilder is up to. As soon as you take off the final,
all bets are off. This is too big a temptation for unmaintainable
code. You need something solid and unshifting for your foundations.
If you want your own String or StringBuilder you can write your own,
cannibalise even, which then is clearly different. They are not that
complicated underneath.

The other reason is speed. It is highly convenient for hotspot to
know that code is final, not just temporarily final until some dynamic
class loads and upsets the apple cart. Imagine what havoc a custom
StringBuilder class being loaded could do to all the finely optimised
HotSpot native code for StringBuilder. It also allows special tuning
for String and StringBuilder, knowing it will not be meddled with by
overriding.
 
T

Thomas Hawtin

Roedy said:
The other reason is speed. It is highly convenient for hotspot to
know that code is final, not just temporarily final until some dynamic
class loads and upsets the apple cart. Imagine what havoc a custom
StringBuilder class being loaded could do to all the finely optimised
HotSpot native code for StringBuilder. It also allows special tuning
for String and StringBuilder, knowing it will not be meddled with by
overriding.

Actually, HotSpot ignores the final flag (although a few methods are
recognised as intrinsics, which may make a difference). And it's quite
happy to inline code called through interfaces.

Tom Hawtin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top