Slow parsing of byte array to String

Discussion in 'Java' started by Eric Razny, Jan 24, 2006.

  1. Eric Razny

    Eric Razny Guest

    Hi.
    I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
    as String.

    A part of the code is as :

    byte buf[]=new byte[L_LENGTH];
    //[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
    String S=new String(buf, 20, 54, "UTF-16BE");


    Of course it works well but it's painfully slow and usenet and web search
    was unsuccessfull.

    Does someone has a trick to do the job faster?

    Thanks,

    Eric
     
    Eric Razny, Jan 24, 2006
    #1
    1. Advertising

  2. Eric Razny wrote:
    > I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
    > as String.
    >
    > A part of the code is as :
    >
    > byte buf[]=new byte[L_LENGTH];
    > //[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
    > String S=new String(buf, 20, 54, "UTF-16BE");
    >
    >
    > Of course it works well but it's painfully slow and usenet and web search
    > was unsuccessfull.
    >
    > Does someone has a trick to do the job faster?


    It depends on how you define "the job". I have trouble believing that
    one invocation of the constructor you are using takes a
    human-discernible amount of time, so you must be doing many such
    decodings. That being the case, it is unclear how you are certain that
    your problem is in that constructor, as opposed to anywhere else in the
    data processing procedure.

    Have you profiled the application? It is not worth your time or mine to
    speculate on how to speed it up without knowing precisely where all the
    cycles are going. If you *have* profiled it then we can guide you
    better based on the full results, whether or not it turns out to be the
    String constructor that is eating all the time.

    --
    John Bollinger
     
    John C. Bollinger, Jan 24, 2006
    #2
    1. Advertising

  3. Eric Razny

    RiCaRdO Guest

    Are you using the + operator on Strings?
    Lots of string concatenation causes the Java garbage collector to work
    over time.
    try using a StringBuffer instead?


    John C. Bollinger wrote:
    > Eric Razny wrote:
    > > I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
    > > as String.
    > >
    > > A part of the code is as :
    > >
    > > byte buf[]=new byte[L_LENGTH];
    > > //[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
    > > String S=new String(buf, 20, 54, "UTF-16BE");
    > >
    > >
    > > Of course it works well but it's painfully slow and usenet and web search
    > > was unsuccessfull.
    > >
    > > Does someone has a trick to do the job faster?

    >
    > It depends on how you define "the job". I have trouble believing that
    > one invocation of the constructor you are using takes a
    > human-discernible amount of time, so you must be doing many such
    > decodings. That being the case, it is unclear how you are certain that
    > your problem is in that constructor, as opposed to anywhere else in the
    > data processing procedure.
    >
    > Have you profiled the application? It is not worth your time or mine to
    > speculate on how to speed it up without knowing precisely where all the
    > cycles are going. If you *have* profiled it then we can guide you
    > better based on the full results, whether or not it turns out to be the
    > String constructor that is eating all the time.
    >
    > --
    > John Bollinger
    >
     
    RiCaRdO, Jan 24, 2006
    #3
  4. Eric Razny

    Chris Uppal Guest

    John C. Bollinger wrote:

    > It depends on how you define "the job". I have trouble believing that
    > one invocation of the constructor you are using takes a
    > human-discernible amount of time, [...]


    Especially since the conversion from UTF16 (BE or LE) is as near trivial as
    charset decoding can possibly get.

    I second John's request for more information. If you don't have profiling data
    then you must have some other timing information. What is it ?

    -- chris
     
    Chris Uppal, Jan 24, 2006
    #4
  5. Eric Razny

    Eric Razny Guest

    Le Tue, 24 Jan 2006 09:30:27 +0000, Chris Uppal a écrit :

    > John C. Bollinger wrote:
    >
    >> It depends on how you define "the job". I have trouble believing that
    >> one invocation of the constructor you are using takes a
    >> human-discernible amount of time, [...]

    >
    > Especially since the conversion from UTF16 (BE or LE) is as near trivial
    > as charset decoding can possibly get.
    >
    > I second John's request for more information. If you don't have profiling
    > data then you must have some other timing information. What is it ?


    John, Chris, thanks for your replies.

    I have parsed the application the more basically I can do : difference
    between getTimeInMillis() from end to begin of prog.

    I stripped out all the real stuff to keep only the parts I suspected to
    be the bottleneck of the application. The remaining code is :

    /////// CODE //////

    import java.io.FileInputStream;
    import java.nio.MappedByteBuffer;
    import java.nio.ByteBuffer;
    import java.nio.channels.FileChannel;
    import java.util.GregorianCalendar;

    public class Test{
    public static void main(String args[]){

    int sum = 0;
    String fileName="/tmp/test.bin"; //file size is 22MB : 22645840
    GregorianCalendar start=new GregorianCalendar();
    try {


    FileInputStream fs = new FileInputStream(fileName);
    FileChannel fchan = null;
    fchan = fs.getChannel();
    int sz = (int)fchan.size();
    MappedByteBuffer mbuf = null;
    mbuf =
    fchan.map(FileChannel.MapMode.READ_ONLY,
    0, sz);

    byte tpb = 0;
    ByteBuffer ibuf = mbuf.asReadOnlyBuffer();
    byte tmpBuf[]=new byte[212];
    while (ibuf.hasRemaining()) {
    ibuf.get(tmpBuf);
    String Part1=new String(tmpBuf, 20, 54, "UTF-16BE"); // comment this line out for profiling
    String Part2=new String(tmpBuf, 74, 132, "UTF-16BE"); // comment this line out for profiling
    }

    }catch(Exception e){e.printStackTrace();}
    System.out.println("Duration in ms : "+(new GregorianCalendar().getTimeInMillis()-start.getTimeInMillis()));
    }
    }
    /////// END OF CODE //////

    Platform :
    java version "1.5.0_01"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08)
    Java HotSpot(TM) Client VM (build 1.5.0_01-b08, mixed mode, sharing)
    Linux 2.6, debian sarge, via processor & chipset.

    If i launch the prog the result is :
    Duration in ms : 4618 (no significant differences between tests)

    Then, when i comment out the two lines flagged above the result is :
    Duration in ms : 383 (no significant differences between tests)

    The 106820 loops are processed faster on... faster computers :) but I must
    keep this platform. Since the 4+ seconds must be added to uncompressible
    other ones and a human is waiting for a result i'll appreciate any
    help to reduce this bottleneck.

    Eric

    PS : sorry for the 3 lines of more than 80 chars.
     
    Eric Razny, Jan 24, 2006
    #5
  6. Eric Razny

    Chris Uppal Guest

    Eric Razny wrote:

    > If i launch the prog the result is :
    > Duration in ms : 4618 (no significant differences between tests)
    >
    > Then, when i comment out the two lines flagged above the result is :
    > Duration in ms : 383 (no significant differences between tests)


    That's odd. I've just created a 22,645,840 byte test file which is just the
    string 0123456789 repeated enough times (as UTF16BE) to fill the file (since
    the conversion from UTF16BE is trivial it shouldn't matter what the actual
    characters are).

    With that, and using JDK 1.5.0, the test runs on this 1.5 GHz Win XP machine in
    about 1.1 seconds. If I comment out the string creation lines then it runs in
    about 0.46 seconds. (Or, if I run under JDK 1.4.2, in about 0.2 seconds)

    Trying again using JDK 1.4.2 on a 2.4 GHz SuSE 8 box takes around 1.3 and 0.2
    seconds respectively. Pretty much the same as 1..4.2 on the Windows box.

    The performance I'm seeing seems quite reasonable to me. I have no idea at all
    why the decoding is taking so much longer on your machine.

    If the choice of character encoding cannot change, and if the data is under
    your control (so you know it is always valid without checking) then you /might/
    be able to avoid the strange bottleneck by hardcoding your own UTF16-BE
    decoder. Something like the following:

    // hardwired UTF16-BE decoding
    // Note:
    // This DOES NOT CHECK for invalid UTF16 sequences
    // and so may allow security breaches.
    // This does not check that byteCount makes sense.
    // This HAS NOT BEEN TESTED on non-ASCII input.
    private String
    decodeUTF16BE(byte[] bytes, int startByte, int byteCount)
    {
    int charCount = byteCount / 2;
    StringBuilder builder = new StringBuilder(charCount);
    for (int i = 0; i < charCount; i++)
    {
    byte low = bytes[startByte++];
    byte high = bytes[startByte++];
    int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
    builder.append((char)ch);
    }

    return builder.toString();
    }

    For me, that's about 2x faster than going through the system-supplied code (in
    part, I assume, because of the missing checks, and possibly also because I've
    got it wrong).

    Oh, BTW, you can get the current time in milliseconds straight from
    System.currentTimeMillis() without having to mess around with
    GregorianCalendars. You can also (since 1.5) get a finer-grained timer with
    System.nanoTime()

    -- chris
     
    Chris Uppal, Jan 24, 2006
    #6
  7. Eric Razny

    Eric Razny Guest

    Le Tue, 24 Jan 2006 15:25:15 +0000, Chris Uppal a écrit :

    [Snap Speed is better on your box than on mine]

    The box has a Via chipset at 600 Mhz :)
    Yours is simply faster, but the ratio still sounds bad to me.


    > If the choice of character encoding cannot change, and if the data is
    > under your control (so you know it is always valid without checking)


    That's the case. I guess the checking burns out most of the time.

    > then
    > you /might/ be able to avoid the strange bottleneck by hardcoding your own
    > UTF16-BE decoder. Something like the following:
    >
    > // hardwired UTF16-BE decoding
    > // Note:
    > // This DOES NOT CHECK for invalid UTF16 sequences // and so may allow
    > security breaches. // This does not check that byteCount makes sense. //
    > This HAS NOT BEEN TESTED on non-ASCII input. private String
    > decodeUTF16BE(byte[] bytes, int startByte, int byteCount) {
    > int charCount = byteCount / 2;
    > StringBuilder builder = new StringBuilder(charCount); for (int i = 0; i
    > < charCount; i++)
    > {
    > byte low = bytes[startByte++];
    > byte high = bytes[startByte++];
    > int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
    > builder.append((char)ch);
    > }
    > }
    > return builder.toString();
    > }
    > }
    > For me, that's about 2x faster than going through the system-supplied code
    > (in part, I assume, because of the missing checks, and possibly also
    > because I've got it wrong).


    Argh! This is the kind of thing I'm searching for... but it must still be
    usable on 1.4.2 JVM :(

    Many thanks anyway.


    > Oh, BTW, you can get the current time in milliseconds straight from
    > System.currentTimeMillis() without having to mess around with
    > GregorianCalendars. You can also (since 1.5) get a finer-grained timer
    > with System.nanoTime()


    Yep, still a sequel of an old, old test with an automatic paste from my
    editor :)

    Eric.
     
    Eric Razny, Jan 24, 2006
    #7
  8. Eric Razny

    Chris Uppal Guest

    Eric Razny wrote:

    > Argh! This is the kind of thing I'm searching for... but it must still be
    > usable on 1.4.2 JVM :(


    Well, you could just change StringBuilder to StringBuffer, but -- at least for
    my quick test -- the synchronised append() method is slow enough to loose most
    of the gains from do-it-yourself UTF16 decoding.

    Probably the easiest thing, at least to try, is to build up the data in a
    char[] array instead:

    ...
    int charCount = byteCount / 2;
    char[] buffer = new char[charCount];
    for (int i = 0; i < charCount; i++)
    {
    byte low = bytes[startByte++];
    byte high = bytes[startByte++];
    int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
    buffer = (char)ch;
    }

    return new String(buffer);

    Somewhat to my surprise, that works (on my machine) rather faster than the
    version which uses a StringBuilder -- despite the extra array allocation and
    copy. I also tried using a statically allocated buffer, to avoid the extra
    allocation (but not the extra copy), but there didn't seem to be any
    performance edge over the simple version (and mutable static variables are not
    a good idea anyway, unless you are willing to synchronise all access to them).

    I don't know whether that depends (for it's performance) on optimisations that
    only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth a quick
    test. BTW, are you using the -server flag ?

    -- chris
     
    Chris Uppal, Jan 24, 2006
    #8
  9. Eric Razny

    Eric Razny Guest

    Le Tue, 24 Jan 2006 19:38:08 +0000, Chris Uppal a écrit :

    > Well, you could just change StringBuilder to StringBuffer, but -- at least
    > for my quick test -- the synchronised append() method is slow enough to
    > loose most of the gains from do-it-yourself UTF16 decoding.


    I lost more than that in my case! The resulting program was slower than
    the original one.


    > Probably the easiest thing, at least to try, is to build up the data in a
    > char[] array instead:
    >
    > ...
    > int charCount = byteCount / 2;
    > char[] buffer = new char[charCount];
    > for (int i = 0; i < charCount; i++)
    > {
    > byte low = bytes[startByte++];
    > byte high = bytes[startByte++];
    > int ch = ((low & 0xFF) << 8 ) | (high & 0xFF); buffer = (char)ch;
    > }
    > }
    > return new String(buffer);


    Great! Wonderfull! Marvel.. ahem, ok, it works! :)

    Total time is now 1395, not anymore 4618. If we substract the fixed 383 ms
    (Yes, i know, there's a 100ms resolution timer) this give :
    1012 vs 4235. A 4x gain.

    > Somewhat to my surprise, that works (on my machine) rather faster than the
    > version which uses a StringBuilder -- despite the extra array allocation
    > and copy. I also tried using a statically allocated buffer, to avoid the
    > extra allocation (but not the extra copy), but there didn't seem to be any
    > performance edge over the simple version


    Tried the same thing on my computer with same result, little if any
    improvment. I didn't try side effects with "uncommon" UTF-16 chars but
    it's perfect for me as I know what can be stored in the file.

    > (and mutable static variables are
    > not a good idea anyway, unless you are willing to synchronise all access
    > to them).


    And loose even more time? :)

    > I don't know whether that depends (for it's performance) on optimisations
    > that only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth
    > a quick test. BTW, are you using the -server flag ?


    For some distribution reasons no. But for this short code the -server
    flag results in a 100ms penality! It seems I need to "unlearn" some
    obvious things about java :)

    Many Thanks for you help.

    Eric
     
    Eric Razny, Jan 24, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bharat Bhushan

    Appending byte[] to another byte[] array

    Bharat Bhushan, Aug 5, 2003, in forum: Java
    Replies:
    15
    Views:
    40,409
    Roedy Green
    Aug 5, 2003
  2. Kirby
    Replies:
    3
    Views:
    671
    Kirby
    Oct 8, 2004
  3. Replies:
    20
    Views:
    9,909
    licebmi
    Sep 7, 2009
  4. Tom McGlynn
    Replies:
    4
    Views:
    880
    Mark Space
    Apr 19, 2008
  5. Patricia Shanahan
    Replies:
    0
    Views:
    408
    Patricia Shanahan
    Apr 17, 2008
Loading...

Share This Page