Byte Stream Vs Char Stream Buffer

Discussion in 'Java' started by Roedy Green, May 7, 2014.

  1. Roedy Green

    Roedy Green Guest

    When you do your lego thing to create a BufferedReader or
    BufferedWriter, you get to specify two buffer sizes, one in bytes for
    the byte stream and one in chars for the character stream.

    Given that you have N bytes to spend on buffers for this file, how
    should you allocate them?

    I have been told the best ratio is 32Kbytes for every 24K chars. I
    can's see why that would be. I would have thought most should go to
    the byte stream for physical i/o. Has anyone ever experimented?
     
    Roedy Green, May 7, 2014
    #1
    1. Advertisements

  2. I'm wondering, if your lego stack is higher than necessary. Are you
    placing BufferedWriter over BufferedOutputStream? I think you can
    put the BufferedOutputStream back into the lego bag in that case. ;)
     
    Andreas Leitgeb, May 8, 2014
    #2
    1. Advertisements

  3. Roedy Green

    Roedy Green Guest

    I think some experiments are in order.
     
    Roedy Green, May 9, 2014
    #3
  4. Roedy Green

    Roedy Green Guest

    There are the results:
    Using a file of random characters of 400 MB and buffers of 32K bytes

    HunkIO 1.20 seconds <-- does an all at once read.
    using BufferedReader backed with BufferedInputStream ratio 0.50
    buffsize 32768 bytes 1.65 seconds
    plain BufferedReader buffsize 32768 bytes 1.76 seconds

    The bottom line is either use HunkIO when you have the RAM or give 50%
    of your buffer space to the inputstream and 50% to the Reader
    in other works 100K of buffer space would have a 50K inputstream
    buffer and the reader would have a 25K buffer because it is measured
    in chars. I tried ratios from .1 to .9. 50% was best.

    There is a second problem with plain buffered readers:

    final BufferedReader ibr = new BufferedReader( new FileReader(
    sampleDataFile ), buffsize / 2 );

    There is no place to put the encoding.


    The "lego" (encapsulated) is what you should use for big files:

    double ratio = .5;

    final FileInputStream fis = new FileInputStream( sampleDataFile );
    final BufferedInputStream bis = new BufferedInputStream( fis,
    ( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
    final InputStreamReader isr = new InputStreamReader( bis,
    UTF8Charset );
    final BufferedReader ibr = new BufferedReader( isr, ( int ) (
    buffsize * ratio / 2 /* in chars */ ) );

    I have an SSD. I will retest this on a hard drive.
     
    Roedy Green, May 10, 2014
    #4
  5. I find this a fairly unspecific description of the test you did. We do
    not know what your test did to ensure file system caching or GC effects
    did not influence the measurement etc.
    If I'm not mistaken you did not test cases where there is just a
    BufferedReader OR a BufferedInputStream - that was what Andreas was
    hinting at if I'm not mistaken. I don't think duplicate buffering will
    make things better. I had expected at least three tests (double
    buffering, byte buffering, char buffering) plus the code.

    Cheers

    robert
     
    Robert Klemme, May 10, 2014
    #5
  6. Roedy Green

    Roedy Green Guest

    the code is posted at
    http://mindprod.com/applet/fileio.html#BUFFERSIZE

    These are not particularly subtle effects. I ran five trials and took
    the best time for each. I figured this would eliminate hotspot
    initialisation and GC effects.
     
    Roedy Green, May 11, 2014
    #6
  7. Roedy Green

    Roedy Green Guest

    It does. And allocating your space exactly 50:50 makes a noticeable
    difference.

    the plain Buffered reader was done like this:

    final BufferedReader ibr = new BufferedReader( new FileReader(
    sampleDataFile ), buffsize / 2 );

    It got whatever the default buffering on the internal inputstream is.
    It is not as fast as specifying buffers for the internal inputstream,
    lego-style.
     
    Roedy Green, May 11, 2014
    #7
  8. I followed a slightly different approach and modularized my test a bit
    more. You can find it here:

    https://gist.github.com/rklemme/fad399b5e7cc4d3b6d0c

    File opening and closing is not measured. You can assume the data to
    purely come from memory (vastly more mem than file size + no disk
    activity shown). This was on an Xbuntu 14.04 64 bit, kernel version
    3.13.0-24. No special JVM command line switches were used.
    Sounds good. I followed the same approach (i.e. repeating and using the
    min time value).

    My results:

    - The higher level the buffering, the more effective.
    - Buffering on byte level often actually rather hurts than helps.
    - Reading in 1k chunks from the application level is already close to
    optimal.

    Kind regards

    robert
     
    Robert Klemme, May 11, 2014
    #8
  9. Roedy Green

    markspace Guest


    Good test suit here all around, Robert. I just wanted address this one
    point quickly. I'm not sure if this is new information or not, but in
    actual use, I've found that it helps to allocate large buffers when
    you're reading a large text file that must be processed quickly.

    Sample:

    ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
    RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
    FileChannel fc = f.getChannel();
    readToCapacity( fc, bb );
    bb.flip();
    ByteBuffer b2 = bb.slice();
    b2.asReadOnlyBuffer();

    Allocating a direct byte buffer of 64k seems to help text processing
    speed by about an order of magnitude. Files that take minutes to
    process go to seconds when large direct byte buffers are used. It seems
    important to get that first big chunk of bytes in before doing anything
    else (like converting to chars).

    The input file for this was about 600k, so if you're reading much
    smaller files the difference might not show up.

    This is just a sample, I don't seem to be able to find the original, so
    I hope it's accurate. Here's the static method used by the above code:

    public static void readToCapacity( FileChannel fc, ByteBuffer buf )
    throws IOException
    {
    final int capacity = buf.capacity();
    int totalRead = 0;
    int bytesRead = 0;
    while( (bytesRead = fc.read( buf ) ) != -1 ) {
    totalRead += bytesRead;
    if( totalRead == capacity ) break;
    }
    }
     
    markspace, May 11, 2014
    #9
  10. Roedy Green

    markspace Guest

    I think this might be the original code that I used to speed up my
    project. It's been a while so I don't know exactly how well this is
    working right now.

    You can wrap this in whatever you'd like -- BufferedRead for example --
    since I just make a plain InputStream here. I also notice that there's
    not even a close method on this class, so obviously I didn't complete it.

    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.nio.ByteBuffer;

    /**
    * Copyright 2013 Brenden Towey DONATED TO THE PUBLIC DOMAIN.
    *
    * PROVIDED AS-IS WITH NO WARRANTY. FURTHER TESTING IS
    * REQUIRED TO COMPLETE THIS CODE. CAVEAT EMPTOR.
    *
    * @author Brenden Towey
    */
    public class FileByteBufferInputStream extends InputStream {

    private ByteBuffer buffer = ByteBuffer.allocateDirect( 64 * 1024 );
    private FileInputStream file;
    int read;
    boolean eof = false;

    public FileByteBufferInputStream( FileInputStream file ) throws
    IOException
    {
    this.file = file;
    read = file.getChannel().read( buffer );
    if( read == -1 ) eof = true;
    buffer.rewind();
    }

    @Override
    public int read()
    throws IOException
    {
    if( eof ) return -1;
    if( read-- == 0 )
    {
    buffer.clear();
    read = file.getChannel().read( buffer );
    if( read == -1 ) eof = true;
    buffer.rewind();
    if( eof ) {
    return -1;
    } else {
    int retval = buffer.get();
    assert( read > 0 );
    read--;
    return retval;
    }
    } else{
    int retval = buffer.get();
    return retval;
    }
    }
    }
     
    markspace, May 11, 2014
    #10
  11. Thank you!
    I seem to recall having done similar tests in the past but found that
    overall the benefit for processing speed was negligible because at some
    point the data has to be moved into the Java heap. But my memory is
    faint here and the situation may have changed in the meantime.
    The code shown so far has some differences to the tests Roedy and I were
    doing:

    1. You are never transferring those bytes to Java land (i.e. into a byte
    or byte[] which is allocated on the heap) - data stays in native land.

    2. You are not reading chars and hence also do not do the character
    decoding.

    You likely did some processing but we cannot see it. But I agree,
    another test with nio would be in order to show whether there is more
    potential for speedup.

    Btw. a memory mapped file may also be an alternative. This can be done
    with nio. It does have some downsides (e.g. different behavior on
    different OS in my experience, address space) and proved not to be
    faster than plain sequential reading with InputStreams in an application
    I have been working on.

    Kind regards

    robert
     
    Robert Klemme, May 11, 2014
    #11
  12. Roedy Green

    markspace Guest


    Yes, I knew the data was "mostly ascii" and therefore I didn't have to
    do character decoding. An efficient UTF-8 converter shouldn't be much
    more complicated however.

    I appear to be counting word lengths in the file, I'm not sure why at
    this point. Some more found code:

    FileInputStream fins = new FileInputStream( path.toFile() );
    FileByteBufferInputStream fbbins =
    new FileByteBufferInputStream( fins );

    int charRead;
    HashedHistogram histogram = new HashedHistogram();
    charRead = fbbins.read();
    StringBuilder sb = new StringBuilder();
    while( charRead != -1 )
    {
    if( charRead < 128 && !Character.isWhitespace( charRead ) ) {
    sb.append( (char) charRead );
    charRead = fbbins.read();
    } else {
    histogram.add( sb.toString() );
    sb.delete( 0, sb.length() );
    while( (Character.isWhitespace( (charRead =
    fbbins.read() )) ||
    charRead >= 128) && charRead != -1 )
    {
    // nothing
    }
    }
    }
    System.out.println( histogram.size() + " words" );
    Entry<Comparable,Integer>[] entries =
    histogram.getSortedEntries();
    System.out.println( "Bottom words:" );
    for( int i = 0; i < 20; i++ )
    System.out.println( entries.getKey()+",
    "+entries.getValue() );
    System.out.println( "Top words:" );
    for( int i = entries.length-1; i > entries.length-41; i-- )
    System.out.println( entries.getKey()+",
    "+entries.getValue() );


    Kind of ugly, but that's what I have.
     
    markspace, May 11, 2014
    #12
  13. Roedy Green

    Roedy Green Guest


    private static void withBufferedReader( double ratio, int buffsize
    ) throws IOException
    {
    assert .1 <= ratio && ratio <= .9 : "ratio must be in range
    [0.1,0.9].";
    long prevTime = System.nanoTime();
    final FileInputStream fis = new FileInputStream(
    sampleDataFile );
    final BufferedInputStream bis = new BufferedInputStream( fis,
    ( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
    final InputStreamReader isr = new InputStreamReader( bis,
    EIO.UTF8Charset );
    final BufferedReader ibr = new BufferedReader( isr, ( int ) (
    buffsize * ratio / 2 /* in chars */ ) );


    so for example if ratio = .5 and buffsize =32K,
    then the stream will get 16K bytes and the reader will get 8k chars.

    This is the optimum ratio.
     
    Roedy Green, May 12, 2014
    #13
  14. Roedy Green

    Silvio Guest

    Roedy,

    It would be interesting to repeat this experiment with different
    character encodings and see if using UTF-16 versus UTF-8 makes a
    difference here.

    Silvio
     
    Silvio, May 12, 2014
    #14
  15. Roedy Green

    Roedy Green Guest

    You are doing your own buffering, not using BufferedReader right?

    In that case, when you read a char[] it might do a big physical i/o,
    no?

    In the days of 16K computers, when you did this sort of thing, the OS
    would schedule a read to a second buffer while you were processing the
    first. Oddly now that we have RAM to burn, we no longer do that though
    I suppose the OS might do some speculative reading.
     
    Roedy Green, May 17, 2014
    #15
  16. Roedy Green

    Roedy Green Guest

    I would like this as an example of nio. I wonder why each step is
    necessary.


    // sample use of NIO to read a text file very rapidly
    // by Mark Space

    // alloc buffer suited for direct physical I/O
    ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );

    RandomAccessFile f = new RandomAccessFile( "blet", "rw" );

    FileChannel fc = f.getChannel();

    // A user method to consume one chunk of data
    // It would have to convert bytes to chars, fretting over boundaries.
    readToCapacity( fc, bb );

    // request the next chunk of data
    bb.flip();

    // ?
    ByteBuffer b2 = bb.slice();

    // creates readonly version and discards
    b2.asReadOnlyBuffer();
     
    Roedy Green, May 17, 2014
    #16
  17. Roedy Green

    markspace Guest

    Yes, this bit of code was an amalgam of other bits that I had saved to
    remind myself how to perform some common operations on ByteBuffers. The
    longer sequence went something like:

    readToCapacity( fc, bb );
    bb.flip();
    ByteBuffer b2 = bb.slice();
    b2.limit( 100 );
    b2.asReadOnlyBuffer();
    readHeader( b2 );
    readFile( bb );

    And it was just a way to be able to isolate the (100 byte) header from
    the rest of the stream, as well as be able to retain all original bytes
    without a potentially expensive seek operation. Mostly just as a
    reminder how to slice up a ByteBuffer into smaller buffers so they can
    be read independently.
    BTW, I'd make that comment "reset limit and position for reading" rather
    than "next chuck" (the next chunk would probably be clear(), because if
    the buffer isn't full you'd run into problems with limit using flip()).

    Flip() is used to switch from put()s to get()s. See the docs:

    http://docs.oracle.com/javase/7/docs/api/java/nio/Buffer.html#flip()
     
    markspace, May 17, 2014
    #17
  18. Roedy Green

    Roedy Green Guest

    I read that but those operations have nothing to do with the
    traditional notions of I/O.

    I have tried to make sense of this and figure out what it does that
    ordinary I/O does not, especially since it legos on top of ordinary
    I/O. I have wimped out twice before.
     
    Roedy Green, May 17, 2014
    #18
  19. Roedy Green

    Roedy Green Guest

    With UTF-16 there are twice as many byte to read, but the 50% magic
    ratio still prevails.

    best 5 five trials shown.

    Using a random sample data file of 209,715,200 chars 419,430,402
    bytes.
    Using aggregate buffersize of 65,536 bytes.
    Using charset UTF-16

    BufferedReader backed with BufferedInputStream ratio 0.10 buffsize
    65536 bytes 2.64 seconds
    BufferedReader backed with BufferedInputStream ratio 0.20 buffsize
    65536 bytes 2.63 seconds
    BufferedReader backed with BufferedInputStream ratio 0.30 buffsize
    65536 bytes 2.64 seconds
    BufferedReader backed with BufferedInputStream ratio 0.40 buffsize
    65536 bytes 2.64 seconds
    BufferedReader backed with BufferedInputStream ratio 0.50 buffsize <--
    65536 bytes 2.64 seconds
    BufferedReader backed with BufferedInputStream ratio 0.60 buffsize
    65536 bytes 2.68 seconds
    BufferedReader backed with BufferedInputStream ratio 0.70 buffsize
    65536 bytes 2.71 seconds
    BufferedReader backed with BufferedInputStream ratio 0.80 buffsize
    65536 bytes 2.79 seconds
    BufferedReader backed with BufferedInputStream ratio 0.90 buffsize
    65536 bytes 2.70 seconds
    HunkIO 2.70 seconds





    Using a random sample data file of 209,715,200 chars 209,715,200
    bytes.
    Using aggregate buffersize of 65,536 bytes.
    Using charset UTF-8
    HunkIO 0.73 seconds
    BufferedReader backed with BufferedInputStream ratio 0.10 buffsize
    65536 bytes 0.89 seconds
    BufferedReader backed with BufferedInputStream ratio 0.20 buffsize
    65536 bytes 0.88 seconds
    BufferedReader backed with BufferedInputStream ratio 0.30 buffsize
    65536 bytes 0.88 seconds
    BufferedReader backed with BufferedInputStream ratio 0.40 buffsize
    65536 bytes 0.89 seconds
    BufferedReader backed with BufferedInputStream ratio 0.50 buffsize <--
    65536 bytes 0.83 seconds
    BufferedReader backed with BufferedInputStream ratio 0.60 buffsize
    65536 bytes 0.91 seconds
    BufferedReader backed with BufferedInputStream ratio 0.70 buffsize
    65536 bytes 0.92 seconds
    BufferedReader backed with BufferedInputStream ratio 0.80 buffsize
    65536 bytes 0.96 seconds
    BufferedReader backed with BufferedInputStream ratio 0.90 buffsize
    65536 bytes 0.92 seconds

    There is something strange. UTF-16 should be faster to convert to
    Strings, and at worst, taking twice as long for physical I/O.
     
    Roedy Green, May 17, 2014
    #19
  20. Roedy Green

    Roedy Green Guest

    UTF-16 ends up reading twice as many bytes. However the magic 50%
    ratio is still optimal.


    Using a random sample data file of 209,715,200 chars 419,430,402
    bytes.
    Using aggregate buffersize of 65,536 bytes.
    Using charset UTF-16
    end read using HunkIO 2.65 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.10 buffsize 65536 bytes 2.70 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.20 buffsize 65536 bytes 2.68 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.30 buffsize 65536 bytes 2.67 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.40 buffsize 65536 bytes 2.68 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.50 buffsize 65536 bytes 2.61 seconds <<<
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.60 buffsize 65536 bytes 2.72 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.70 buffsize 65536 bytes 2.73 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.80 buffsize 65536 bytes 2.78 seconds
    end read using using BufferedReader backed with BufferedInputStream
    ratio 0.90 buffsize 65536 bytes 2.72 seconds
     
    Roedy Green, May 18, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.