Slow parsing of byte array to String

E

Eric Razny

Hi.
I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
as String.

A part of the code is as :

byte buf[]=new byte[L_LENGTH];
//[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
String S=new String(buf, 20, 54, "UTF-16BE");


Of course it works well but it's painfully slow and usenet and web search
was unsuccessfull.

Does someone has a trick to do the job faster?

Thanks,

Eric
 
J

John C. Bollinger

Eric said:
I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
as String.

A part of the code is as :

byte buf[]=new byte[L_LENGTH];
//[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
String S=new String(buf, 20, 54, "UTF-16BE");


Of course it works well but it's painfully slow and usenet and web search
was unsuccessfull.

Does someone has a trick to do the job faster?

It depends on how you define "the job". I have trouble believing that
one invocation of the constructor you are using takes a
human-discernible amount of time, so you must be doing many such
decodings. That being the case, it is unclear how you are certain that
your problem is in that constructor, as opposed to anywhere else in the
data processing procedure.

Have you profiled the application? It is not worth your time or mine to
speculate on how to speed it up without knowing precisely where all the
cycles are going. If you *have* profiled it then we can guide you
better based on the full results, whether or not it turns out to be the
String constructor that is eating all the time.
 
R

RiCaRdO

Are you using the + operator on Strings?
Lots of string concatenation causes the Java garbage collector to work
over time.
try using a StringBuffer instead?

Eric said:
I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
as String.

A part of the code is as :

byte buf[]=new byte[L_LENGTH];
//[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
String S=new String(buf, 20, 54, "UTF-16BE");


Of course it works well but it's painfully slow and usenet and web search
was unsuccessfull.

Does someone has a trick to do the job faster?

It depends on how you define "the job". I have trouble believing that
one invocation of the constructor you are using takes a
human-discernible amount of time, so you must be doing many such
decodings. That being the case, it is unclear how you are certain that
your problem is in that constructor, as opposed to anywhere else in the
data processing procedure.

Have you profiled the application? It is not worth your time or mine to
speculate on how to speed it up without knowing precisely where all the
cycles are going. If you *have* profiled it then we can guide you
better based on the full results, whether or not it turns out to be the
String constructor that is eating all the time.
 
C

Chris Uppal

John said:
It depends on how you define "the job". I have trouble believing that
one invocation of the constructor you are using takes a
human-discernible amount of time, [...]

Especially since the conversion from UTF16 (BE or LE) is as near trivial as
charset decoding can possibly get.

I second John's request for more information. If you don't have profiling data
then you must have some other timing information. What is it ?

-- chris
 
E

Eric Razny

Le Tue, 24 Jan 2006 09:30:27 +0000, Chris Uppal a écrit :
John said:
It depends on how you define "the job". I have trouble believing that
one invocation of the constructor you are using takes a
human-discernible amount of time, [...]

Especially since the conversion from UTF16 (BE or LE) is as near trivial
as charset decoding can possibly get.

I second John's request for more information. If you don't have profiling
data then you must have some other timing information. What is it ?

John, Chris, thanks for your replies.

I have parsed the application the more basically I can do : difference
between getTimeInMillis() from end to begin of prog.

I stripped out all the real stuff to keep only the parts I suspected to
be the bottleneck of the application. The remaining code is :

/////// CODE //////

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.GregorianCalendar;

public class Test{
public static void main(String args[]){

int sum = 0;
String fileName="/tmp/test.bin"; //file size is 22MB : 22645840
GregorianCalendar start=new GregorianCalendar();
try {


FileInputStream fs = new FileInputStream(fileName);
FileChannel fchan = null;
fchan = fs.getChannel();
int sz = (int)fchan.size();
MappedByteBuffer mbuf = null;
mbuf =
fchan.map(FileChannel.MapMode.READ_ONLY,
0, sz);

byte tpb = 0;
ByteBuffer ibuf = mbuf.asReadOnlyBuffer();
byte tmpBuf[]=new byte[212];
while (ibuf.hasRemaining()) {
ibuf.get(tmpBuf);
String Part1=new String(tmpBuf, 20, 54, "UTF-16BE"); // comment this line out for profiling
String Part2=new String(tmpBuf, 74, 132, "UTF-16BE"); // comment this line out for profiling
}

}catch(Exception e){e.printStackTrace();}
System.out.println("Duration in ms : "+(new GregorianCalendar().getTimeInMillis()-start.getTimeInMillis()));
}
}
/////// END OF CODE //////

Platform :
java version "1.5.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08)
Java HotSpot(TM) Client VM (build 1.5.0_01-b08, mixed mode, sharing)
Linux 2.6, debian sarge, via processor & chipset.

If i launch the prog the result is :
Duration in ms : 4618 (no significant differences between tests)

Then, when i comment out the two lines flagged above the result is :
Duration in ms : 383 (no significant differences between tests)

The 106820 loops are processed faster on... faster computers :) but I must
keep this platform. Since the 4+ seconds must be added to uncompressible
other ones and a human is waiting for a result i'll appreciate any
help to reduce this bottleneck.

Eric

PS : sorry for the 3 lines of more than 80 chars.
 
C

Chris Uppal

Eric said:
If i launch the prog the result is :
Duration in ms : 4618 (no significant differences between tests)

Then, when i comment out the two lines flagged above the result is :
Duration in ms : 383 (no significant differences between tests)

That's odd. I've just created a 22,645,840 byte test file which is just the
string 0123456789 repeated enough times (as UTF16BE) to fill the file (since
the conversion from UTF16BE is trivial it shouldn't matter what the actual
characters are).

With that, and using JDK 1.5.0, the test runs on this 1.5 GHz Win XP machine in
about 1.1 seconds. If I comment out the string creation lines then it runs in
about 0.46 seconds. (Or, if I run under JDK 1.4.2, in about 0.2 seconds)

Trying again using JDK 1.4.2 on a 2.4 GHz SuSE 8 box takes around 1.3 and 0.2
seconds respectively. Pretty much the same as 1..4.2 on the Windows box.

The performance I'm seeing seems quite reasonable to me. I have no idea at all
why the decoding is taking so much longer on your machine.

If the choice of character encoding cannot change, and if the data is under
your control (so you know it is always valid without checking) then you /might/
be able to avoid the strange bottleneck by hardcoding your own UTF16-BE
decoder. Something like the following:

// hardwired UTF16-BE decoding
// Note:
// This DOES NOT CHECK for invalid UTF16 sequences
// and so may allow security breaches.
// This does not check that byteCount makes sense.
// This HAS NOT BEEN TESTED on non-ASCII input.
private String
decodeUTF16BE(byte[] bytes, int startByte, int byteCount)
{
int charCount = byteCount / 2;
StringBuilder builder = new StringBuilder(charCount);
for (int i = 0; i < charCount; i++)
{
byte low = bytes[startByte++];
byte high = bytes[startByte++];
int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
builder.append((char)ch);
}

return builder.toString();
}

For me, that's about 2x faster than going through the system-supplied code (in
part, I assume, because of the missing checks, and possibly also because I've
got it wrong).

Oh, BTW, you can get the current time in milliseconds straight from
System.currentTimeMillis() without having to mess around with
GregorianCalendars. You can also (since 1.5) get a finer-grained timer with
System.nanoTime()

-- chris
 
E

Eric Razny

Le Tue, 24 Jan 2006 15:25:15 +0000, Chris Uppal a écrit :

[Snap Speed is better on your box than on mine]

The box has a Via chipset at 600 Mhz :)
Yours is simply faster, but the ratio still sounds bad to me.

If the choice of character encoding cannot change, and if the data is
under your control (so you know it is always valid without checking)

That's the case. I guess the checking burns out most of the time.
then
you /might/ be able to avoid the strange bottleneck by hardcoding your own
UTF16-BE decoder. Something like the following:

// hardwired UTF16-BE decoding
// Note:
// This DOES NOT CHECK for invalid UTF16 sequences // and so may allow
security breaches. // This does not check that byteCount makes sense. //
This HAS NOT BEEN TESTED on non-ASCII input. private String
decodeUTF16BE(byte[] bytes, int startByte, int byteCount) {
int charCount = byteCount / 2;
StringBuilder builder = new StringBuilder(charCount); for (int i = 0; i
< charCount; i++)
{
byte low = bytes[startByte++];
byte high = bytes[startByte++];
int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
builder.append((char)ch);
}
}
return builder.toString();
}
}
For me, that's about 2x faster than going through the system-supplied code
(in part, I assume, because of the missing checks, and possibly also
because I've got it wrong).

Argh! This is the kind of thing I'm searching for... but it must still be
usable on 1.4.2 JVM :(

Many thanks anyway.

Oh, BTW, you can get the current time in milliseconds straight from
System.currentTimeMillis() without having to mess around with
GregorianCalendars. You can also (since 1.5) get a finer-grained timer
with System.nanoTime()

Yep, still a sequel of an old, old test with an automatic paste from my
editor :)

Eric.
 
C

Chris Uppal

Eric said:
Argh! This is the kind of thing I'm searching for... but it must still be
usable on 1.4.2 JVM :(

Well, you could just change StringBuilder to StringBuffer, but -- at least for
my quick test -- the synchronised append() method is slow enough to loose most
of the gains from do-it-yourself UTF16 decoding.

Probably the easiest thing, at least to try, is to build up the data in a
char[] array instead:

...
int charCount = byteCount / 2;
char[] buffer = new char[charCount];
for (int i = 0; i < charCount; i++)
{
byte low = bytes[startByte++];
byte high = bytes[startByte++];
int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
buffer = (char)ch;
}

return new String(buffer);

Somewhat to my surprise, that works (on my machine) rather faster than the
version which uses a StringBuilder -- despite the extra array allocation and
copy. I also tried using a statically allocated buffer, to avoid the extra
allocation (but not the extra copy), but there didn't seem to be any
performance edge over the simple version (and mutable static variables are not
a good idea anyway, unless you are willing to synchronise all access to them).

I don't know whether that depends (for it's performance) on optimisations that
only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth a quick
test. BTW, are you using the -server flag ?

-- chris
 
E

Eric Razny

Le Tue, 24 Jan 2006 19:38:08 +0000, Chris Uppal a écrit :
Well, you could just change StringBuilder to StringBuffer, but -- at least
for my quick test -- the synchronised append() method is slow enough to
loose most of the gains from do-it-yourself UTF16 decoding.

I lost more than that in my case! The resulting program was slower than
the original one.

Probably the easiest thing, at least to try, is to build up the data in a
char[] array instead:

...
int charCount = byteCount / 2;
char[] buffer = new char[charCount];
for (int i = 0; i < charCount; i++)
{
byte low = bytes[startByte++];
byte high = bytes[startByte++];
int ch = ((low & 0xFF) << 8 ) | (high & 0xFF); buffer = (char)ch;
}
}
return new String(buffer);


Great! Wonderfull! Marvel.. ahem, ok, it works! :)

Total time is now 1395, not anymore 4618. If we substract the fixed 383 ms
(Yes, i know, there's a 100ms resolution timer) this give :
1012 vs 4235. A 4x gain.
Somewhat to my surprise, that works (on my machine) rather faster than the
version which uses a StringBuilder -- despite the extra array allocation
and copy. I also tried using a statically allocated buffer, to avoid the
extra allocation (but not the extra copy), but there didn't seem to be any
performance edge over the simple version

Tried the same thing on my computer with same result, little if any
improvment. I didn't try side effects with "uncommon" UTF-16 chars but
it's perfect for me as I know what can be stored in the file.
(and mutable static variables are
not a good idea anyway, unless you are willing to synchronise all access
to them).

And loose even more time? :)
I don't know whether that depends (for it's performance) on optimisations
that only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth
a quick test. BTW, are you using the -server flag ?

For some distribution reasons no. But for this short code the -server
flag results in a 100ms penality! It seems I need to "unlearn" some
obvious things about java :)

Many Thanks for you help.

Eric
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top