NIO not so hot

Discussion in 'Java' started by Roedy Green, May 17, 2014.

  1. Roedy Green

    Roedy Green Guest

    I did a benchmark to read a file of bytes in one I/O then convert it
    to a String

    // Using a random sample data file of 419,430,400 chars
    419,430,400 bytes UTF-8.
    // RandomAccess 1.46 seconds
    // InputStream 1.48 seconds
    // NIO 1.56 seconds

    NIO is great for grabbing bytes, but if you have to suck them out of
    the buffer, it does a get() call on every byte.

    The code is posted at
    Roedy Green, May 17, 2014
    1. Advertisements

  2. This is not true for all cases. For example, if ByteBuffer and
    CharBuffer have an array this method is invoked, which will directly
    access these arrays:
    sun.nio.cs.UTF_8.Decoder.decodeArrayLoop(ByteBuffer, CharBuffer)
    The code suffers from too much copying: in readFileAtOnceWithNIO() you
    use a direct buffer, then need to copy it into a byte[] (which btw. does
    not use individual get() for every byte, see
    java.nio.DirectByteBuffer.get(byte[], int, int)) and then you create the
    String (which copies the data again but that cannot be avoided). If you
    use a heap byte buffer one level of copying can be omitted, because you
    can access the byte[] inside and create the String with this constructor:[],
    int, int)

    However, the test is quite unrealistic since this is not how NIO is
    usually used. The whole purpose of Buffer and subclasses is to read
    data in chunks.

    I have extended the recent test case for char decoding to include NIO.
    Because NIO character decoding I created a version which does a rough
    CRC calculation so I was able to verify my implementation read all the
    characters in the proper order. You can find all the code here:

    Kind regards

    Robert Klemme, May 18, 2014
    1. Advertisements

  3. Robert Klemme, May 18, 2014
  4. Error was duplicate counting chars during creation. This is fixed now.
    Robert Klemme, May 18, 2014
  5. I rearranged execution order a bit to group all read operations on one
    file and all NIO reads with direct or heap buffer.

    My takeaways:

    - IO and NIO have roughly same performance for char decoding
    if done properly.
    - Adding byte buffering to IO does not help, rather it makes
    things slower.
    - Reading into char[] with IO is more efficient than using
    char buffering.
    - NIO direct buffers are slower than heap buffers; they are
    probably best used if the data does not need to reach
    the Java heap (e.g. when copying a file to another file
    or a socket).
    - Best performance with NIO is with a multiple of memory page
    size (or file system cluster size?).
    - decoding of pure ASCII in UTF-8 is much more efficient
    than random UTF-8 data (over 4 times faster per byte than
    the mixed UTF-8 file).

    ASCII 0,001525879 us/byte and char
    UTF 0,006485037 us/byte
    UTF 0,019073486 us/char


    Robert Klemme, May 18, 2014
  6. Roedy Green

    Roedy Green Guest

    That is what you would expect because NIO is doing its own byte
    buffering, so an extra layer just gets in the way.

    However for ordinary i/o I discovered allocating your space 50:50 to
    the byte and char buffer was optimal.
    Roedy Green, May 26, 2014
  7. Roedy Green

    Roedy Green Guest

    The code I was complaining about is ByteBuffer.get

    Copying is a bane of Java. Even with decodeArrayLoop to get you to a
    CharBuffer, you still need at least one more copy to get a String.

    I think the general principle is NIO only works properly if you can do
    all you work inside the buffers, without extracting it as a whole.

    I have wondered if at a hardware level, CPUs might be designed that do
    lazy copies of arbitrary hunks of bytes.

    They might:

    1. behave like String, giving you a reference to a read only copy.
    2. do lazy copies in the background
    3. when you actually attempt to change the underlying data, it then
    actually makes the copy, or a copy of the part you are trying to
    4. lets you request or relinquish read/write access.
    5. have some sort of hardware that shovels 1024+ bytes around at a
    time like a super GPU, possibly integrated with page mapping.

    For example, new String ( char[] ) does a copy that I hope someday
    will be avoided. new String would make a lazy copy of the char[]. If
    nobody further modified the char[], the usual case, then the copy
    would be free.
    Roedy Green, May 26, 2014
  8. Roedy Green

    Roedy Green Guest

    With this particular benchmark I was not trying to demonstrate the use
    of NIO, but decide the optimal way to read a whole file of characters
    at a time, something I do very often.

    Even though I have written some code that at least functions using NIO
    I can't say I understand it. I primarily just glue together methods
    based on the types of parameters and return types. I don't have an
    overall picture of how it works or why it works, or what it is for, as
    I do for ordinary I/O.

    I just have vague notion that if you keep your data in buffers, off
    the Java heap, and ignore most of it, NIO will work faster than
    ordinary I/O.

    I would be happy to post some sample code, explanations etc. at
    if you are up to expounding on NIO.
    Roedy Green, May 26, 2014
  9. IO - not NIO!
    This is contrary to what my results show. Did you look at them or run
    the tests yourself?


    Robert Klemme, May 26, 2014
  10. Why do you do that? Wouldn't that run the risk of using too much
    memory? I mean, usually you want to extract information from the file
    You do not necessarily have to ignore it. But as long as you just do
    raw IO (i.e. copying data from one place to the other) then direct
    ByteBuffer seems to perform best.
    Others have more time and experience to do that. NIO is more
    complicated and offers more control for a greater variety of use cases.
    If you just want to serially read a file using blocking IO the old IO
    is probably best - even performance wise, as we have seen.

    Kind regards

    Robert Klemme, May 26, 2014
  11. Roedy Green

    Rupert Smith Guest

    Try using a direct byte buffer which has been pre-allocated. Direct buffersare allocated outside the Java heap (using malloc()?), so the allocation cost is high. They only really provide a performance boost when re-used.

    Also, if you dig into the internals you will find that a heap buffer reading a file or socket will copy bytes from a direct buffer anyway, and that Java does its own internal pooling/re-allocation of direct buffers.

    Often benchmarks will say heap buffers are faster, because they allocate buffer then read some data then allow buffer to be garbage collected. In the heap buffer case, the internal direct buffer pool is being used. In the direct buffer case, a new one is being allocated each time, which is slow.

    I may be wrong but... are the byte get()/set() calls not trapped by some compiler intrinsics and optimized away?

    I did a lot of performance testing around NIO working for a company that developed a FIX engine. Independent testing was carried out by Intel, and we were every bit as fast as the best C++ engines (once JIT compilation was done anyway). Developing your own pooling mechanism for direct buffers is definitely the way to go if you really want to make your code as fast as possible. Allocation costs and memory copying need to be avoided as much as possible. That said, zero copy IO is still largely a myth.

    Rupert Smith, May 31, 2014
  12. That of course depends on the usage scenario, e.g. the frequency of
    allocation etc. If you serve long lasting connections the cost of
    allocating and freeing a DirectByteBuffer is negligible and other
    reasons may gain more weight in the decision to use a direct or heap
    buffer (e.g. whether access to the byte[] can make things faster as I
    assume is the case in my decoding tests posted upthread).
    Can you point me to more information about this? Or are you referring
    to OpenJDK's source code?
    I think heap byte buffers were faster in my tests (see upthread) not
    because of allocation and GC (this was not included in the time
    measurement) but rather because data would cross the boundary between
    non Java heap memory (where they arrive from the OS) to Java heap more
    infrequently because of the larger batches. If you have to fetch
    individual bytes from a ByteBuffer off Java heap you have to make the
    transition much more frequent.
    Can you point me to writing about that internal byte buffer pool in the
    JRE? I could not find anything.
    DirectByteBuffer.get() contains a native call to fetch the byte - and I
    don't think the JIT will optimize away native calls. The JRE just does
    not have any insights into what JNI calls do.
    While I agree with that general tendency of the statement ("allocation
    costs") I believe nowadays one needs to be very careful with these
    statements. For example, if you share immutable data structures across
    threads which require copying during manipulation that allocation and GC
    cost may very well be smaller than the cost of locking in a more
    traditional approach. The correct answer is "it depends" all to often -
    which is disappointing but eventually more helpful. :)
    I guess, the best you can get is reading from a memory mapped file and
    writing those bytes directly to another channel, i.e. without those
    bytes needing to enter the Java heap. Of course there is just a limited
    set of use cases that fit this model.


    Robert Klemme, Jun 1, 2014
  13. Roedy Green

    Rupert Smith Guest

    The allocation cost is unfortunately not negligable. Allocation cost withinthe heap is very low, because it is easy to do. Outside the heap with a malloc() type algorithm can be considerably slower, because free blocks may need to be searched for.

    We could try:

    ByteBuffer.allocate() in a loop and see.

    If you are servicing a long running connection, create a direct buffer bigenough to handle it, and re-use it on subsequent reads. In the case of theFIX engine I wrote, this model worked well, because FIX is ASCII (that is a price would be "1.234" as ASCII characters), and needs to be decoded intobinary. So I would read some ASCII into the buffer, then decode into a binary form, then get some more bytes into the buffer once the original ones were consumed, and so on.
    Yes, I looked in the OpenJDK source code. You don't have to dig too far under or socket.write() to find it.
    As I say, this does seem to have been optimized, although I admit I am a little unsure as to exactly how. It was certainly the case in 1.4 and maybe 1..5 that heap buffer array [] access was faster, and get()/set() was slow. Ihave seen benchmarks and run my own micro-benachmarks which suggest that get()/set() is now every bit as fast as the array access.
    Exactly what I though, yet it does seem to be optimized.
    Indeed. In some situations we used mutable data structures accross threads,which of course is dangerous if the programmer does not know how to handleit, and difficult to get right even if they do.

    Rupert Smith, Jun 2, 2014
  14. Roedy Green

    Rupert Smith Guest

    Take a look here:,java.lang.Object)

    Line 179.

    You can see:

    179 static int read(FileDescriptor fd, ByteBuffer dst, long position,
    180 NativeDispatcher nd, Object lock)
    181 throws IOException
    182 {
    183 if (dst.isReadOnly())
    184 throw new IllegalArgumentException("Read-only buffer");
    185 if (dst instanceof DirectBuffer)
    186 return readIntoNativeBuffer(fd, dst, position, nd, lock);
    188 // Substitute a native buffer
    189 ByteBuffer bb = Util.getTemporaryDirectBuffer(dst.remaining());
    190 try {
    191 int n = readIntoNativeBuffer(fd, bb, position, nd, lock);
    192 bb.flip();
    193 if (n > 0)
    194 dst.put(bb);
    195 return n;
    196 } finally {
    197 Util.offerFirstTemporaryDirectBuffer(bb);
    198 }
    199 }

    So when using a heap buffer, a temporary direct buffer is taken from a pool, read into, then the data is copied into the heap buffer.

    Many benchmarks will do:

    time this {
    // Read some data into the buffer

    time this {
    // Read some data into the buffer

    And come to the conclusion that heap buffers are faster. But now we know that every heap buffer IO operation uses a direct buffer under the covers, how can heap buffer IO operations be faster?

    If we do the pooling ourselves, we can find that direct buffers are faster.

    Rupert Smith, Jun 2, 2014
  15. A remark upfront: Google Groups really screws up line breaks. Can you
    please use a different text type or even a proper news reader?

    This is not what I said.
    All true, but I did not question that at all.
    Thank you! I'll have a look once I find the time.
    On a heap buffer, yes.

    In case I did not mention it: I tested with OpenJDK 7.55 64 bit.
    I don't think so. I think my test showed the exact opposite. If you
    believe differently please point out where exactly I am missing
    something. And / or present a test which proves your point.


    Robert Klemme, Jun 3, 2014
  16. Roedy Green

    Rupert Smith Guest

    I have to admit its been a while since I did some micro bench-marking around this. I still have the code I used, and you have got me intrigued, so I will take another look. Thanks.

    Rupert Smith, Jun 6, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.