NIO not so hot

R

Roedy Green

I did a benchmark to read a file of bytes in one I/O then convert it
to a String

// Using a random sample data file of 419,430,400 chars
419,430,400 bytes UTF-8.
// RandomAccess 1.46 seconds
// InputStream 1.48 seconds
// NIO 1.56 seconds

NIO is great for grabbing bytes, but if you have to suck them out of
the buffer, it does a get() call on every byte.

The code is posted at http://mindprod.com/jgloss/nio.html
 
R

Robert Klemme

I did a benchmark to read a file of bytes in one I/O then convert it
to a String

// Using a random sample data file of 419,430,400 chars
419,430,400 bytes UTF-8.
// RandomAccess 1.46 seconds
// InputStream 1.48 seconds
// NIO 1.56 seconds

NIO is great for grabbing bytes, but if you have to suck them out of
the buffer, it does a get() call on every byte.

This is not true for all cases. For example, if ByteBuffer and
CharBuffer have an array this method is invoked, which will directly
access these arrays:
sun.nio.cs.UTF_8.Decoder.decodeArrayLoop(ByteBuffer, CharBuffer)

The code suffers from too much copying: in readFileAtOnceWithNIO() you
use a direct buffer, then need to copy it into a byte[] (which btw. does
not use individual get() for every byte, see
java.nio.DirectByteBuffer.get(byte[], int, int)) and then you create the
String (which copies the data again but that cannot be avoided). If you
use a heap byte buffer one level of copying can be omitted, because you
can access the byte[] inside and create the String with this constructor:

http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String(byte[],
int, int)

However, the test is quite unrealistic since this is not how NIO is
usually used. The whole purpose of Buffer and subclasses is to read
data in chunks.

I have extended the recent test case for char decoding to include NIO.
Because NIO character decoding I created a version which does a rough
CRC calculation so I was able to verify my implementation read all the
characters in the proper order. You can find all the code here:

https://gist.github.com/rklemme/fad399b5e7cc4d3b6d0c

Kind regards

robert
 
R

Robert Klemme

PS: I just notice some oddity with the ASCII file size which is too
small. I need to check that.

Error was duplicate counting chars during creation. This is fixed now.
 
R

Robert Klemme

Error was duplicate counting chars during creation. This is fixed now.

I rearranged execution order a bit to group all read operations on one
file and all NIO reads with direct or heap buffer.

https://gist.github.com/rklemme/fad399b5e7cc4d3b6d0c#file-output-txt

My takeaways:

- IO and NIO have roughly same performance for char decoding
if done properly.
- Adding byte buffering to IO does not help, rather it makes
things slower.
- Reading into char[] with IO is more efficient than using
char buffering.
- NIO direct buffers are slower than heap buffers; they are
probably best used if the data does not need to reach
the Java heap (e.g. when copying a file to another file
or a socket).
- Best performance with NIO is with a multiple of memory page
size (or file system cluster size?).
- decoding of pure ASCII in UTF-8 is much more efficient
than random UTF-8 data (over 4 times faster per byte than
the mixed UTF-8 file).

ASCII 0,001525879 us/byte and char
UTF 0,006485037 us/byte
UTF 0,019073486 us/char

Cheers

robert
 
R

Roedy Green

- Adding byte buffering to IO does not help, rather it makes
things slower.

That is what you would expect because NIO is doing its own byte
buffering, so an extra layer just gets in the way.

However for ordinary i/o I discovered allocating your space 50:50 to
the byte and char buffer was optimal.
 
R

Roedy Green

This is not true for all cases. For example, if ByteBuffer and
CharBuffer have an array this method is invoked, which will directly
access these arrays:
sun.nio.cs.UTF_8.Decoder.decodeArrayLoop(ByteBuffer, CharBuffer)

The code I was complaining about is ByteBuffer.get

Copying is a bane of Java. Even with decodeArrayLoop to get you to a
CharBuffer, you still need at least one more copy to get a String.

I think the general principle is NIO only works properly if you can do
all you work inside the buffers, without extracting it as a whole.

I have wondered if at a hardware level, CPUs might be designed that do
lazy copies of arbitrary hunks of bytes.

They might:

1. behave like String, giving you a reference to a read only copy.
2. do lazy copies in the background
3. when you actually attempt to change the underlying data, it then
actually makes the copy, or a copy of the part you are trying to
change.
4. lets you request or relinquish read/write access.
5. have some sort of hardware that shovels 1024+ bytes around at a
time like a super GPU, possibly integrated with page mapping.

For example, new String ( char[] ) does a copy that I hope someday
will be avoided. new String would make a lazy copy of the char[]. If
nobody further modified the char[], the usual case, then the copy
would be free.
 
R

Roedy Green

However, the test is quite unrealistic since this is not how NIO is
usually used.

With this particular benchmark I was not trying to demonstrate the use
of NIO, but decide the optimal way to read a whole file of characters
at a time, something I do very often.

Even though I have written some code that at least functions using NIO
I can't say I understand it. I primarily just glue together methods
based on the types of parameters and return types. I don't have an
overall picture of how it works or why it works, or what it is for, as
I do for ordinary I/O.

I just have vague notion that if you keep your data in buffers, off
the Java heap, and ignore most of it, NIO will work faster than
ordinary I/O.

I would be happy to post some sample code, explanations etc. at
http://mindprod.com/jgloss/nio.html
if you are up to expounding on NIO.
 
R

Robert Klemme

That is what you would expect because NIO is doing its own byte
buffering, so an extra layer just gets in the way.

IO - not NIO!
However for ordinary i/o I discovered allocating your space 50:50 to
the byte and char buffer was optimal.

This is contrary to what my results show. Did you look at them or run
the tests yourself?

Regards

robert
 
R

Robert Klemme

With this particular benchmark I was not trying to demonstrate the use
of NIO, but decide the optimal way to read a whole file of characters
at a time, something I do very often.

Why do you do that? Wouldn't that run the risk of using too much
memory? I mean, usually you want to extract information from the file
I just have vague notion that if you keep your data in buffers, off
the Java heap, and ignore most of it, NIO will work faster than
ordinary I/O.

You do not necessarily have to ignore it. But as long as you just do
raw IO (i.e. copying data from one place to the other) then direct
ByteBuffer seems to perform best.
I would be happy to post some sample code, explanations etc. at
http://mindprod.com/jgloss/nio.html
if you are up to expounding on NIO.

Others have more time and experience to do that. NIO is more
complicated and offers more control for a greater variety of use cases.
If you just want to serially read a file using blocking IO the old IO
is probably best - even performance wise, as we have seen.

Kind regards

robert
 
R

Rupert Smith

I did a benchmark to read a file of bytes in one I/O then convert it

to a String a random sample data file of 419,430,400 chars

419,430,400 bytes UTF-8.

// RandomAccess 1.46 seconds

// InputStream 1.48 seconds

// NIO 1.56 seconds



NIO is great for grabbing bytes, but if you have to suck them out of

the buffer, it does a get() call on every byte.



The code is posted at http://mindprod.com/jgloss/nio.html



--

Roedy Green Canadian Mind Products http://mindprod.com

Young man, in mathematics you don't understand things.

You just get used to them.

~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)

Try using a direct byte buffer which has been pre-allocated. Direct buffersare allocated outside the Java heap (using malloc()?), so the allocation cost is high. They only really provide a performance boost when re-used.

Also, if you dig into the internals you will find that a heap buffer reading a file or socket will copy bytes from a direct buffer anyway, and that Java does its own internal pooling/re-allocation of direct buffers.

Often benchmarks will say heap buffers are faster, because they allocate buffer then read some data then allow buffer to be garbage collected. In the heap buffer case, the internal direct buffer pool is being used. In the direct buffer case, a new one is being allocated each time, which is slow.

I may be wrong but... are the byte get()/set() calls not trapped by some compiler intrinsics and optimized away?

I did a lot of performance testing around NIO working for a company that developed a FIX engine. Independent testing was carried out by Intel, and we were every bit as fast as the best C++ engines (once JIT compilation was done anyway). Developing your own pooling mechanism for direct buffers is definitely the way to go if you really want to make your code as fast as possible. Allocation costs and memory copying need to be avoided as much as possible. That said, zero copy IO is still largely a myth.

Rupert
 
R

Robert Klemme

Try using a direct byte buffer which has been pre-allocated. Direct
buffers are allocated outside the Java heap (using malloc()?), so the
allocation cost is high. They only really provide a performance boost
when re-used.

That of course depends on the usage scenario, e.g. the frequency of
allocation etc. If you serve long lasting connections the cost of
allocating and freeing a DirectByteBuffer is negligible and other
reasons may gain more weight in the decision to use a direct or heap
buffer (e.g. whether access to the byte[] can make things faster as I
assume is the case in my decoding tests posted upthread).
Also, if you dig into the internals you will find that a heap buffer
reading a file or socket will copy bytes from a direct buffer anyway,
and that Java does its own internal pooling/re-allocation of direct
buffers.

Can you point me to more information about this? Or are you referring
to OpenJDK's source code?
Often benchmarks will say heap buffers are faster, because they
allocate buffer then read some data then allow buffer to be garbage
collected.

I think heap byte buffers were faster in my tests (see upthread) not
because of allocation and GC (this was not included in the time
measurement) but rather because data would cross the boundary between
non Java heap memory (where they arrive from the OS) to Java heap more
infrequently because of the larger batches. If you have to fetch
individual bytes from a ByteBuffer off Java heap you have to make the
transition much more frequent.
In the heap buffer case, the internal direct buffer pool
is being used. In the direct buffer case, a new one is being
allocated each time, which is slow.

Can you point me to writing about that internal byte buffer pool in the
JRE? I could not find anything.
I may be wrong but... are the byte get()/set() calls not trapped by
some compiler intrinsics and optimized away?

DirectByteBuffer.get() contains a native call to fetch the byte - and I
don't think the JIT will optimize away native calls. The JRE just does
not have any insights into what JNI calls do.
Allocation costs
and memory copying need to be avoided as much as possible.

While I agree with that general tendency of the statement ("allocation
costs") I believe nowadays one needs to be very careful with these
statements. For example, if you share immutable data structures across
threads which require copying during manipulation that allocation and GC
cost may very well be smaller than the cost of locking in a more
traditional approach. The correct answer is "it depends" all to often -
which is disappointing but eventually more helpful. :)
That said, zero copy IO is still largely a myth.

I guess, the best you can get is reading from a memory mapped file and
writing those bytes directly to another channel, i.e. without those
bytes needing to enter the Java heap. Of course there is just a limited
set of use cases that fit this model.

Cheers

robert
 
R

Rupert Smith

Try using a direct byte buffer which has been pre-allocated. Direct
buffers are allocated outside the Java heap (using malloc()?), so the
allocation cost is high. They only really provide a performance boost
when re-used.



That of course depends on the usage scenario, e.g. the frequency of

allocation etc. If you serve long lasting connections the cost of

allocating and freeing a DirectByteBuffer is negligible and other

reasons may gain more weight in the decision to use a direct or heap

buffer (e.g. whether access to the byte[] can make things faster as I

assume is the case in my decoding tests posted upthread).

The allocation cost is unfortunately not negligable. Allocation cost withinthe heap is very low, because it is easy to do. Outside the heap with a malloc() type algorithm can be considerably slower, because free blocks may need to be searched for.

We could try:

ByteBuffer.allocate() in a loop and see.

If you are servicing a long running connection, create a direct buffer bigenough to handle it, and re-use it on subsequent reads. In the case of theFIX engine I wrote, this model worked well, because FIX is ASCII (that is a price would be "1.234" as ASCII characters), and needs to be decoded intobinary. So I would read some ASCII into the buffer, then decode into a binary form, then get some more bytes into the buffer once the original ones were consumed, and so on.
Can you point me to more information about this? Or are you referring

to OpenJDK's source code?

Yes, I looked in the OpenJDK source code. You don't have to dig too far under socket.read() or socket.write() to find it.
I think heap byte buffers were faster in my tests (see upthread) not

because of allocation and GC (this was not included in the time

measurement) but rather because data would cross the boundary between

non Java heap memory (where they arrive from the OS) to Java heap more

infrequently because of the larger batches. If you have to fetch

individual bytes from a ByteBuffer off Java heap you have to make the

transition much more frequent.

As I say, this does seem to have been optimized, although I admit I am a little unsure as to exactly how. It was certainly the case in 1.4 and maybe 1..5 that heap buffer array [] access was faster, and get()/set() was slow. Ihave seen benchmarks and run my own micro-benachmarks which suggest that get()/set() is now every bit as fast as the array access.
Can you point me to writing about that internal byte buffer pool in the

JRE? I could not find anything.







DirectByteBuffer.get() contains a native call to fetch the byte - and I

don't think the JIT will optimize away native calls. The JRE just does

not have any insights into what JNI calls do.

Exactly what I though, yet it does seem to be optimized.
While I agree with that general tendency of the statement ("allocation

costs") I believe nowadays one needs to be very careful with these

statements. For example, if you share immutable data structures across

threads which require copying during manipulation that allocation and GC

cost may very well be smaller than the cost of locking in a more

traditional approach.

Indeed. In some situations we used mutable data structures accross threads,which of course is dangerous if the programmer does not know how to handleit, and difficult to get right even if they do.

Rupert
 
R

Rupert Smith

Can you point me to writing about that internal byte buffer pool in the
JRE? I could not find anything.

Take a look here:

http://grepcode.com/file/repository...sun.nio.ch.NativeDispatcher,java.lang.Object)

Line 179.

You can see:

179 static int read(FileDescriptor fd, ByteBuffer dst, long position,
180 NativeDispatcher nd, Object lock)
181 throws IOException
182 {
183 if (dst.isReadOnly())
184 throw new IllegalArgumentException("Read-only buffer");
185 if (dst instanceof DirectBuffer)
186 return readIntoNativeBuffer(fd, dst, position, nd, lock);
187
188 // Substitute a native buffer
189 ByteBuffer bb = Util.getTemporaryDirectBuffer(dst.remaining());
190 try {
191 int n = readIntoNativeBuffer(fd, bb, position, nd, lock);
192 bb.flip();
193 if (n > 0)
194 dst.put(bb);
195 return n;
196 } finally {
197 Util.offerFirstTemporaryDirectBuffer(bb);
198 }
199 }

So when using a heap buffer, a temporary direct buffer is taken from a pool, read into, then the data is copied into the heap buffer.

Many benchmarks will do:

time this {
ByteBuffer.allocateDirect();
// Read some data into the buffer
}

time this {
ByteBuffer.allocate();
// Read some data into the buffer
}

And come to the conclusion that heap buffers are faster. But now we know that every heap buffer IO operation uses a direct buffer under the covers, how can heap buffer IO operations be faster?

If we do the pooling ourselves, we can find that direct buffers are faster.

Rupert
 
R

Robert Klemme

A remark upfront: Google Groups really screws up line breaks. Can you
please use a different text type or even a proper news reader?

Try using a direct byte buffer which has been pre-allocated. Direct
buffers are allocated outside the Java heap (using malloc()?), so the
allocation cost is high. They only really provide a performance boost
when re-used.

That of course depends on the usage scenario, e.g. the frequency of
allocation etc. If you serve long lasting connections the cost of
allocating and freeing a DirectByteBuffer is negligible and other
reasons may gain more weight in the decision to use a direct or heap
buffer (e.g. whether access to the byte[] can make things faster as I
assume is the case in my decoding tests posted upthread).

The allocation cost is unfortunately not negligable.

This is not what I said.
Allocation cost within the heap is very low, because it is easy to
do. Outside the heap with a malloc() type algorithm can be
considerably slower, because free blocks may need to be searched
for.

All true, but I did not question that at all.
Yes, I looked in the OpenJDK source code. You don't have to dig too
far under socket.read() or socket.write() to find it.

Thank you! I'll have a look once I find the time.
I think heap byte buffers were faster in my tests (see upthread) not
because of allocation and GC (this was not included in the time
measurement) but rather because data would cross the boundary between
non Java heap memory (where they arrive from the OS) to Java heap more
infrequently because of the larger batches. If you have to fetch
individual bytes from a ByteBuffer off Java heap you have to make the
transition much more frequent.

As I say, this does seem to have been optimized, although I admit I
am a little unsure as to exactly how. It was certainly the case in
1.4 and maybe 1.5 that heap buffer array [] access was faster, and
get()/set() was slow. I have seen benchmarks and run my own
micro-benachmarks which suggest that get()/set() is now every bit as
fast as the array access.

On a heap buffer, yes.

In case I did not mention it: I tested with OpenJDK 7.55 64 bit.
Exactly what I though, yet it does seem to be optimized.

I don't think so. I think my test showed the exact opposite. If you
believe differently please point out where exactly I am missing
something. And / or present a test which proves your point.

Cheers

robert
 
R

Rupert Smith

I don't think so. I think my test showed the exact opposite. If you
believe differently please point out where exactly I am missing
something. And / or present a test which proves your point.

I have to admit its been a while since I did some micro bench-marking around this. I still have the code I used, and you have got me intrigued, so I will take another look. Thanks.

Rupert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top