new String ( byte[] , encoding ) under the hood

Discussion in 'Java' started by Roedy Green, Jan 14, 2006.

  1. Roedy Green

    Roedy Green Guest

    I was curious how new String ( byte[], encoding ) could guess the
    correct size of the buffer to convert into String.

    It makes an estimate based on number of bytes times the max number of
    chars per byte, an attribute of the encoding. This will be slightly
    on the high side if there are any multibyte chars, but accurate for
    Latin-1. It then decodes, and calls trim to System.arraycopy to get an
    char[] the right size. The new String then does another
    System.arraycopy.

    You leave in your wake the original byte[], two char[] and the string.

    Going the other way String -> byte uses similar logic, but the buffer
    size is not so fortunate. For UTF-8 it makes the conservative
    assumption each char might need 3 bytes, making the buffer 3 times
    bigger than it needs to be in the ordinary case.

    Sun could optimise could streamline these operations to cut out the
    intermediate objects.

    Here's an idea. Why not allow strings and char arrays etc to
    temporarily be too big. They are logically sized. Only on the next GC
    do the objects get pruned to size if need be. You would save a lot of
    copying and new object creating just to get arrays the precise correct
    size. There would be a method to prune an array to size that just
    logically chopped it and marked it for later true pruning. Most of the
    time though such objects will soon be discarded, and you then get away
    without ever doing the copy.


    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Jan 14, 2006
    #1
    1. Advertising


  2. > Here's an idea. Why not allow strings and char arrays etc to
    > temporarily be too big. They are logically sized. Only on the next GC
    > do the objects get pruned to size if need be. You would save a lot of
    > copying and new object creating just to get arrays the precise correct
    > size. There would be a method to prune an array to size that just
    > logically chopped it and marked it for later true pruning. Most of the
    > time though such objects will soon be discarded, and you then get away
    > without ever doing the copy.


    I would mainly object that doing things this way would make the garbage
    collector do work it is not normally supposed to do, that is, the
    housekeeping of the String class. The garbage collector is supposed to
    reclaim unreachable objects. It is pretty good at that. It is not
    supposed to do much else.

    If this truely and absolutely does become the bottleneck of your
    application, i would suggest doing it "by hand" in a more efficient way
    (for example, re-using one char [] as decode buffer).
     
    Stefan Schulz, Jan 14, 2006
    #2
    1. Advertising

  3. Roedy Green

    Chris Uppal Guest

    Roedy Green wrote:

    > Here's an idea. Why not allow strings and char arrays etc to
    > temporarily be too big. They are logically sized. Only on the next GC
    > do the objects get pruned to size if need be. You would save a lot of
    > copying and new object creating just to get arrays the precise correct
    > size. There would be a method to prune an array to size that just
    > logically chopped it and marked it for later true pruning. Most of the
    > time though such objects will soon be discarded, and you then get away
    > without ever doing the copy.


    Some GCed languages do allow you to change the size of array objects,
    and that /could/ be implemented in the way you describe (though I'm not
    sure it'd be worth it). Off the top of my head, I cannot think of a
    persuasive reason why Java does not allow dynamic resizing of arrays.

    -- chris
     
    Chris Uppal, Jan 14, 2006
    #3
  4. "Roedy Green" <> wrote in
    message news:...
    >I was curious how new String ( byte[], encoding ) could guess the
    > correct size of the buffer to convert into String.
    >
    > It makes an estimate based on number of bytes times the max number of
    > chars per byte, an attribute of the encoding. This will be slightly
    > on the high side if there are any multibyte chars, but accurate for
    > Latin-1. It then decodes, and calls trim to System.arraycopy to get an
    > char[] the right size. The new String then does another
    > System.arraycopy.


    Why doesn't it just create the String via new String(char[], int, int),
    which would eliminate the extra copy? Better still would be a String
    constructor that takes an array of character arrays and a total length,
    (say, new String(char[][], int) to eliminate the need to allocate a big
    contiguous character array in the first place. Instead, smaller buffers
    (say, 4K) could be allocated as required. String always has to copy all the
    characters to make an immutable char array, the place to achieve savings
    would be before this.
     
    Mike Schilling, Jan 14, 2006
    #4
  5. Roedy Green

    Roedy Green Guest

    On 14 Jan 2006 13:10:06 GMT, "Chris Uppal"
    <-THIS.org> wrote, quoted or indirectly
    quoted someone who said :

    >
    >Some GCed languages do allow you to change the size of array objects,
    >and that /could/ be implemented in the way you describe (though I'm not
    >sure it'd be worth it). Off the top of my head, I cannot think of a
    >persuasive reason why Java does not allow dynamic resizing of arrays.


    resize down should be easier than resize up. With down you dont HAVE
    to change any allocation right way. Resize down in very common. Resize
    up is usually done with ArrayList where you allocate new ram and copy.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Jan 14, 2006
    #5
  6. Roedy Green

    Roedy Green Guest

    On Sat, 14 Jan 2006 19:34:43 GMT, "Mike Schilling"
    <> wrote, quoted or indirectly quoted
    someone who said :

    > String always has to copy all the
    >characters to make an immutable char array, the place to achieve savings
    >would be before this.


    I wonder if there could be some way to hand off a char array to be
    inserted in a string. The problem is ensuring nobody holds onto a
    reference to it.

    That way you could avoid the copy. It gets quite silly how much
    copying goes on to do the simplest things.

    There needs to me some low level, perhaps even hardware mechanism to
    hand off a chunk of RAM in such a way the original owner can no longer
    meddle with it, and perhaps not even see it. It might be done by
    remapping a vm page to another address.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Jan 14, 2006
    #6
  7. Roedy Green wrote:
    > There needs to me some low level, perhaps even hardware mechanism to
    > hand off a chunk of RAM in such a way the original owner can no longer
    > meddle with it, and perhaps not even see it. It might be done by
    > remapping a vm page to another address.


    This is something you generally can not truely control, unless you
    remap the entire page to be read-only afterwards (which is, by the way,
    far below java's threshold when it comes to OS specific features)

    Also, as much as it makes me sound like a spoilsport: Memory copies are
    cheap. I would be extremely surprised of the string copying made up
    even 1% of your typical applications time.

    To phrase things a bit differently: Before worrying about how you
    handle your strings, worry about the performance of your XML parser,
    your database driver and GUI. ;)
     
    Stefan Schulz, Jan 15, 2006
    #7
  8. "Mike Schilling" <> wrote in message
    news:nncyf.1003$...

    >
    > Why doesn't it just create the String via new String(char[], int, int),
    > which would eliminate the extra copy? Better still would be a String
    > constructor that takes an array of character arrays and a total length,
    > (say, new String(char[][], int) to eliminate the need to allocate a big
    > contiguous character array in the first place. Instead, smaller buffers
    > (say, 4K) could be allocated as required. String always has to copy all
    > the characters to make an immutable char array, the place to achieve
    > savings would be before this.


    Let me refine this. What I'd really like to see is

    new String(Reader rdr, int maxLength)

    Create a string from characters read from rdr, with the lenght of the
    resulting string being the minimum of

    1. The number of characters that can be read from rdr, and
    2. maxLen

    Likewise

    new String(InputStream strm, String encoding, int maxBytes)

    This would eliminate the need to move all of the chanracters into a
    contiguous array to be copied to a second contiguous array. It sould be of
    great help in, for instance, XML parsers, where text that needs to be put
    into a string quite possibly crosses buffer boundaries.
     
    Mike Schilling, Jan 16, 2006
    #8
  9. Roedy Green

    Chris Uppal Guest

    I wrote:

    > Some GCed languages do allow you to change the size of array objects,
    > and that /could/ be implemented in the way you describe (though I'm not
    > sure it'd be worth it). Off the top of my head, I cannot think of a
    > persuasive reason why Java does not allow dynamic resizing of arrays.


    Having thought more about it, I suspect that the problem is that there is no
    safe /and/ fast way of allowing multi-threaded access to an array which can
    change size.

    If you get the threading wrong for accessing an array of fixed size, then all
    that happens is that your application reads the wrong data. If the code that
    is checking the array bounds uses a stale size value, then you can break
    security and/or crash the JVM. The latter possibilities -- not unreasonably --
    are considered unreasonable ;-)

    -- chris
     
    Chris Uppal, Jan 16, 2006
    #9
  10. Roedy Green

    Chris Uppal Guest

    Mike Schilling wrote:

    > new String(Reader rdr, int maxLength)


    One problem with that is that it creates a dependency from the -- very core --
    String class to the -- rather non-core -- IO classes. I think that would be
    undesirable, even though your proposed methods otherwise make a lot of sense.

    (Incidentally, I've just realised that similar thinking may underlie the
    absence of a form of String.split() which takes a compiled regexp rather than a
    String.)

    It's a bit awkward really. It would be nice to have such things, but when (a)
    Java lacks "open" classes, and (b) String is declared final, there's not a lot
    a room for manoeuvre.

    It would be nice to create alternative kinds of String with different internal
    implementations -- such as using a UTF-32 or UTF-8 encoding internally, or
    using a variable-sized collection of char[] arrays to hold their data. Sadly
    we cannot. I don't think that making String final was such a silly idea /at
    the time/ but in retrospect I think it was an unfortunate choice.

    -- chris
     
    Chris Uppal, Jan 16, 2006
    #10
  11. Chris Uppal wrote:
    > Mike Schilling wrote:
    >
    >> new String(Reader rdr, int maxLength)

    >
    > One problem with that is that it creates a dependency from the --
    > very core -- String class to the -- rather non-core -- IO classes.
    > I think that would be undesirable, even though your proposed methods
    > otherwise make a lot of sense.


    Totally agree.

    > (Incidentally, I've just realised that similar thinking may underlie
    > the absence of a form of String.split() which takes a compiled regexp
    > rather than a String.)


    Yeah, likely.

    > It's a bit awkward really. It would be nice to have such things, but
    > when (a) Java lacks "open" classes, and (b) String is declared final,
    > there's not a lot a room for manoeuvre.


    That's an euphemism. :)

    > It would be nice to create alternative kinds of String with different
    > internal implementations -- such as using a UTF-32 or UTF-8 encoding
    > internally, or using a variable-sized collection of char[] arrays to
    > hold their data. Sadly we cannot. I don't think that making String
    > final was such a silly idea /at the time/ but in retrospect I think
    > it was an unfortunate choice.


    I'm not so sure. After all, what's the advantage of subclassing String
    when there's CharSequence? Granted, it came in quite late and quite some
    methods provide only for String arguments instead of CharSequence. But
    personally I never run into a situation where I actually whished I had a
    String with these properties. YMMV though.

    Kind regards

    robert
     
    Robert Klemme, Jan 16, 2006
    #11
  12. Roedy Green

    Chris Uppal Guest

    Robert Klemme wrote:

    > > It would be nice to create alternative kinds of String with different
    > > internal implementations -- such as using a UTF-32 or UTF-8 encoding
    > > internally, or using a variable-sized collection of char[] arrays to
    > > hold their data. Sadly we cannot. I don't think that making String
    > > final was such a silly idea /at the time/ but in retrospect I think
    > > it was an unfortunate choice.

    >
    > I'm not so sure. After all, what's the advantage of subclassing String
    > when there's CharSequence? Granted, it came in quite late and quite some
    > methods provide only for String arguments instead of CharSequence.


    Agreed, up to a point. My reservations being: (A) CharSequence is little used
    in practise, and I don't see much chance of that changing. (A) "CharSequence"
    is a silly name for wide use; "String" is the only reasonable name. (B) The
    CharSequence interface is too narrow -- it doesn't correspond to the abstract
    API of a String.

    (Which, now I come to think of it, is quite a long list of reservations --
    perhaps I shouldn't have started by saying "Agreed" ;-)

    -- chris
     
    Chris Uppal, Jan 16, 2006
    #12
  13. Chris Uppal wrote:
    > Robert Klemme wrote:
    >
    >>> It would be nice to create alternative kinds of String with
    >>> different internal implementations -- such as using a UTF-32 or
    >>> UTF-8 encoding internally, or using a variable-sized collection of
    >>> char[] arrays to hold their data. Sadly we cannot. I don't think
    >>> that making String final was such a silly idea /at the time/ but in
    >>> retrospect I think it was an unfortunate choice.

    >>
    >> I'm not so sure. After all, what's the advantage of subclassing
    >> String when there's CharSequence? Granted, it came in quite late
    >> and quite some methods provide only for String arguments instead of
    >> CharSequence.

    >
    > Agreed, up to a point. My reservations being: (A) CharSequence is
    > little used in practise, and I don't see much chance of that
    > changing.


    I guess that's a problem of education caused by the late arrival of CS.

    > (A) "CharSequence" is a silly name for wide use; "String"
    > is the only reasonable name.


    Well, String is definitely much better although, given the circumstances,
    I find CharSequence describes pretty much what it's about - again, the
    late arrival...

    > (B) The CharSequence interface is too
    > narrow -- it doesn't correspond to the abstract API of a String.


    I on the other hand think that String's interface is bloated. There's a
    lot stuff in there that probably doesn't belong there. CharSequence
    contains really the basic stuff for an immutable string but String
    contains a lot string processing that IMHO doesn't necessarily belong
    there (split() for example, maybe concat() and replace*(), other methods
    which have in part been flagged deprecated). But it's certainly
    debatable. While for example a general indexOf algorithm could well be
    put into some class as static method which solely relies on CharSequence,
    implementations for String are likely much more efficient if implemented
    in class String itself. Library design isn't easy... :)

    > (Which, now I come to think of it, is quite a long list of
    > reservations -- perhaps I shouldn't have started by saying "Agreed"
    > ;-)


    LOL

    Kind regards

    robert
     
    Robert Klemme, Jan 16, 2006
    #13
  14. Roedy Green

    Googmeister Guest

    Chris Uppal wrote:

    > I don't think that making String final was such a silly idea /at
    > the time/ but in retrospect I think it was an unfortunate choice.


    The Java designers wanted String to be immutable, and this is
    probably the reason it is final. Same with Integer, etc.

    But maybe I am wrong, since StringBuilder and StringBuffer
    are also final, even though they are mutable.
     
    Googmeister, Jan 16, 2006
    #14
  15. "Chris Uppal" <-THIS.org> wrote in message
    news:43cb902d$0$87296$...
    > Mike Schilling wrote:
    >
    >> new String(Reader rdr, int maxLength)

    >
    > One problem with that is that it creates a dependency from the -- very
    > core --
    > String class to the -- rather non-core -- IO classes. I think that would
    > be
    > undesirable, even though your proposed methods otherwise make a lot of
    > sense.


    As opposed to System.out and Exception.printStackTrace(Writer) :)

    Java implementations must, by license, implement java.io every bit as fully
    as java.lang.
     
    Mike Schilling, Jan 16, 2006
    #15
  16. "Googmeister" <> wrote in message
    news:...
    >
    > Chris Uppal wrote:
    >
    >> I don't think that making String final was such a silly idea /at
    >> the time/ but in retrospect I think it was an unfortunate choice.

    >
    > The Java designers wanted String to be immutable, and this is
    > probably the reason it is final. Same with Integer, etc.
    >
    > But maybe I am wrong, since StringBuilder and StringBuffer
    > are also final, even though they are mutable.


    They are (or were pre-1.5 at least) intimate enough with String that making
    them unfinal would introduced holes into String's immutability.
     
    Mike Schilling, Jan 16, 2006
    #16
  17. Mike Schilling wrote:
    > "Googmeister" <> wrote in message
    > news:...
    >
    >>
    >>But maybe I am wrong, since StringBuilder and StringBuffer
    >>are also final, even though they are mutable.

    >
    > They are (or were pre-1.5 at least) intimate enough with String that making
    > them unfinal would introduced holes into String's immutability.


    That doesn't explain why StringBuilder is final.

    There are other security reasons for requiring final. Say I had a
    security conscious class and I allowed some method that took a
    StringBuilder and appended some private objects to it. Now imagine
    someone malicious comes along and overrides
    StringBuilder.append(Object). Malicious code now has access to my
    sensitive object. Presumably that is why ObjectOutputStream.writeObject
    is final, and why writeObjectOverride and the auditSubclass nonsense was
    introduced.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Jan 16, 2006
    #17
  18. Roedy Green

    Roedy Green Guest

    On Mon, 16 Jan 2006 14:52:30 +0100, "Robert Klemme" <>
    wrote, quoted or indirectly quoted someone who said :

    >I on the other hand think that String's interface is bloated.


    So where should such methods go?

    What are the disadvantages of putting them on String?

    Coding convenience having all the methods at hand for String is a big
    plus.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Jan 16, 2006
    #18
  19. Roedy Green

    Roedy Green Guest

    On 16 Jan 2006 06:02:27 -0800, "Googmeister" <>
    wrote, quoted or indirectly quoted someone who said :

    >
    >But maybe I am wrong, since StringBuilder and StringBuffer
    >are also final, even though they are mutable.


    Two thoughts. When reading code, you have no doubts about what a
    String or StringBuilder is up to. As soon as you take off the final,
    all bets are off. This is too big a temptation for unmaintainable
    code. You need something solid and unshifting for your foundations.
    If you want your own String or StringBuilder you can write your own,
    cannibalise even, which then is clearly different. They are not that
    complicated underneath.

    The other reason is speed. It is highly convenient for hotspot to
    know that code is final, not just temporarily final until some dynamic
    class loads and upsets the apple cart. Imagine what havoc a custom
    StringBuilder class being loaded could do to all the finely optimised
    HotSpot native code for StringBuilder. It also allows special tuning
    for String and StringBuilder, knowing it will not be meddled with by
    overriding.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Jan 16, 2006
    #19
  20. Roedy Green wrote:
    >
    > The other reason is speed. It is highly convenient for hotspot to
    > know that code is final, not just temporarily final until some dynamic
    > class loads and upsets the apple cart. Imagine what havoc a custom
    > StringBuilder class being loaded could do to all the finely optimised
    > HotSpot native code for StringBuilder. It also allows special tuning
    > for String and StringBuilder, knowing it will not be meddled with by
    > overriding.


    Actually, HotSpot ignores the final flag (although a few methods are
    recognised as intrinsics, which may make a difference). And it's quite
    happy to inline code called through interfaces.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Jan 16, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green

    String.substring, under the hood

    Roedy Green, Jan 14, 2006, in forum: Java
    Replies:
    10
    Views:
    3,460
    Alan Krueger
    Jan 16, 2006
  2. Davmagic .Com

    What's Under The Hood? (OT)

    Davmagic .Com, Dec 9, 2003, in forum: HTML
    Replies:
    1
    Views:
    350
    Louis Somers
    Dec 9, 2003
  3. Eric Pederson

    File objects? - under the hood question

    Eric Pederson, Jan 19, 2005, in forum: Python
    Replies:
    3
    Views:
    293
    Jeremy Bowers
    Jan 21, 2005
  4. python under the hood

    , Oct 20, 2006, in forum: Python
    Replies:
    3
    Views:
    774
    John Salerno
    Oct 20, 2006
  5. mikeyz9
    Replies:
    1
    Views:
    607
    Johannes Schaub (litb)
    Mar 11, 2010
Loading...

Share This Page