Byte Stream Vs Char Stream Buffer

Roedy Green · May 7, 2014

When you do your lego thing to create a BufferedReader or
BufferedWriter, you get to specify two buffer sizes, one in bytes for
the byte stream and one in chars for the character stream.

Given that you have N bytes to spend on buffers for this file, how
should you allocate them?

I have been told the best ratio is 32Kbytes for every 24K chars. I
can's see why that would be. I would have thought most should go to
the byte stream for physical i/o. Has anyone ever experimented?

Andreas Leitgeb · May 8, 2014

Roedy Green said:
When you do your lego thing to create a BufferedReader or
BufferedWriter, you get to specify two buffer sizes, one in bytes for
the byte stream and one in chars for the character stream.

I'm wondering, if your lego stack is higher than necessary. Are you
placing BufferedWriter over BufferedOutputStream? I think you can
put the BufferedOutputStream back into the lego bag in that case.

Roedy Green · May 9, 2014

I'm wondering, if your lego stack is higher than necessary. Are you
placing BufferedWriter over BufferedOutputStream? I think you can
put the BufferedOutputStream back into the lego bag in that case.

I think some experiments are in order.

Roedy Green · May 10, 2014

I think some experiments are in order.

There are the results:
Using a file of random characters of 400 MB and buffers of 32K bytes

HunkIO 1.20 seconds <-- does an all at once read.
using BufferedReader backed with BufferedInputStream ratio 0.50
buffsize 32768 bytes 1.65 seconds
plain BufferedReader buffsize 32768 bytes 1.76 seconds

The bottom line is either use HunkIO when you have the RAM or give 50%
of your buffer space to the inputstream and 50% to the Reader
in other works 100K of buffer space would have a 50K inputstream
buffer and the reader would have a 25K buffer because it is measured
in chars. I tried ratios from .1 to .9. 50% was best.

There is a second problem with plain buffered readers:

final BufferedReader ibr = new BufferedReader( new FileReader(
sampleDataFile ), buffsize / 2 );

There is no place to put the encoding.

The "lego" (encapsulated) is what you should use for big files:

double ratio = .5;

final FileInputStream fis = new FileInputStream( sampleDataFile );
final BufferedInputStream bis = new BufferedInputStream( fis,
( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
final InputStreamReader isr = new InputStreamReader( bis,
UTF8Charset );
final BufferedReader ibr = new BufferedReader( isr, ( int ) (
buffsize * ratio / 2 /* in chars */ ) );

I have an SSD. I will retest this on a hard drive.

Robert Klemme · May 10, 2014

There are the results:
Using a file of random characters of 400 MB and buffers of 32K bytes

HunkIO 1.20 seconds <-- does an all at once read.
using BufferedReader backed with BufferedInputStream ratio 0.50
buffsize 32768 bytes 1.65 seconds
plain BufferedReader buffsize 32768 bytes 1.76 seconds

I find this a fairly unspecific description of the test you did. We do
not know what your test did to ensure file system caching or GC effects
did not influence the measurement etc.

The bottom line is either use HunkIO when you have the RAM or give 50%
of your buffer space to the inputstream and 50% to the Reader
in other works 100K of buffer space would have a 50K inputstream
buffer and the reader would have a 25K buffer because it is measured
in chars. I tried ratios from .1 to .9. 50% was best.

If I'm not mistaken you did not test cases where there is just a
BufferedReader OR a BufferedInputStream - that was what Andreas was
hinting at if I'm not mistaken. I don't think duplicate buffering will
make things better. I had expected at least three tests (double
buffering, byte buffering, char buffering) plus the code.

Cheers

robert

Roedy Green · May 11, 2014

I find this a fairly unspecific description of the test you did. We do
not know what your test did to ensure file system caching or GC effects
did not influence the measurement etc.

the code is posted at
http://mindprod.com/applet/fileio.html#BUFFERSIZE

These are not particularly subtle effects. I ran five trials and took
the best time for each. I figured this would eliminate hotspot
initialisation and GC effects.

Roedy Green · May 11, 2014

I don't think duplicate buffering will
make things better.

It does. And allocating your space exactly 50:50 makes a noticeable
difference.

the plain Buffered reader was done like this:

final BufferedReader ibr = new BufferedReader( new FileReader(
sampleDataFile ), buffsize / 2 );

It got whatever the default buffering on the internal inputstream is.
It is not as fast as specifying buffers for the internal inputstream,
lego-style.

Robert Klemme · May 11, 2014

the code is posted at
http://mindprod.com/applet/fileio.html#BUFFERSIZE

I followed a slightly different approach and modularized my test a bit
more. You can find it here:

https://gist.github.com/rklemme/fad399b5e7cc4d3b6d0c

File opening and closing is not measured. You can assume the data to
purely come from memory (vastly more mem than file size + no disk
activity shown). This was on an Xbuntu 14.04 64 bit, kernel version
3.13.0-24. No special JVM command line switches were used.

These are not particularly subtle effects. I ran five trials and took
the best time for each. I figured this would eliminate hotspot
initialisation and GC effects.

Sounds good. I followed the same approach (i.e. repeating and using the
min time value).

My results:

- The higher level the buffering, the more effective.
- Buffering on byte level often actually rather hurts than helps.
- Reading in 1k chunks from the application level is already close to
optimal.

Kind regards

robert

markspace · May 11, 2014

- Reading in 1k chunks from the application level is already close to
optimal.

Good test suit here all around, Robert. I just wanted address this one
point quickly. I'm not sure if this is new information or not, but in
actual use, I've found that it helps to allocate large buffers when
you're reading a large text file that must be processed quickly.

Sample:

ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
FileChannel fc = f.getChannel();
readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.asReadOnlyBuffer();

Allocating a direct byte buffer of 64k seems to help text processing
speed by about an order of magnitude. Files that take minutes to
process go to seconds when large direct byte buffers are used. It seems
important to get that first big chunk of bytes in before doing anything
else (like converting to chars).

The input file for this was about 600k, so if you're reading much
smaller files the difference might not show up.

This is just a sample, I don't seem to be able to find the original, so
I hope it's accurate. Here's the static method used by the above code:

public static void readToCapacity( FileChannel fc, ByteBuffer buf )
throws IOException
{
final int capacity = buf.capacity();
int totalRead = 0;
int bytesRead = 0;
while( (bytesRead = fc.read( buf ) ) != -1 ) {
totalRead += bytesRead;
if( totalRead == capacity ) break;
}
}

markspace · May 11, 2014

I think this might be the original code that I used to speed up my
project. It's been a while so I don't know exactly how well this is
working right now.

You can wrap this in whatever you'd like -- BufferedRead for example --
since I just make a plain InputStream here. I also notice that there's
not even a close method on this class, so obviously I didn't complete it.

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;

/**
* Copyright 2013 Brenden Towey DONATED TO THE PUBLIC DOMAIN.
*
* PROVIDED AS-IS WITH NO WARRANTY. FURTHER TESTING IS
* REQUIRED TO COMPLETE THIS CODE. CAVEAT EMPTOR.
*
* @author Brenden Towey
*/
public class FileByteBufferInputStream extends InputStream {

private ByteBuffer buffer = ByteBuffer.allocateDirect( 64 * 1024 );
private FileInputStream file;
int read;
boolean eof = false;

public FileByteBufferInputStream( FileInputStream file ) throws
IOException
{
this.file = file;
read = file.getChannel().read( buffer );
if( read == -1 ) eof = true;
buffer.rewind();
}

@Override
public int read()
throws IOException
{
if( eof ) return -1;
if( read-- == 0 )
{
buffer.clear();
read = file.getChannel().read( buffer );
if( read == -1 ) eof = true;
buffer.rewind();
if( eof ) {
return -1;
} else {
int retval = buffer.get();
assert( read > 0 );
read--;
return retval;
}
} else{
int retval = buffer.get();
return retval;
}
}
}

Robert Klemme · May 11, 2014

Good test suit here all around, Robert.

Thank you!

I just wanted address this one
point quickly. I'm not sure if this is new information or not, but in
actual use, I've found that it helps to allocate large buffers when
you're reading a large text file that must be processed quickly.

Sample:

ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
FileChannel fc = f.getChannel();
readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.asReadOnlyBuffer();

Allocating a direct byte buffer of 64k seems to help text processing
speed by about an order of magnitude. Files that take minutes to
process go to seconds when large direct byte buffers are used. It seems
important to get that first big chunk of bytes in before doing anything
else (like converting to chars).

I seem to recall having done similar tests in the past but found that
overall the benefit for processing speed was negligible because at some
point the data has to be moved into the Java heap. But my memory is
faint here and the situation may have changed in the meantime.

The input file for this was about 600k, so if you're reading much
smaller files the difference might not show up.

This is just a sample, I don't seem to be able to find the original, so
I hope it's accurate. Here's the static method used by the above code:

public static void readToCapacity( FileChannel fc, ByteBuffer buf )
throws IOException
{
final int capacity = buf.capacity();
int totalRead = 0;
int bytesRead = 0;
while( (bytesRead = fc.read( buf ) ) != -1 ) {
totalRead += bytesRead;
if( totalRead == capacity ) break;
}
}

The code shown so far has some differences to the tests Roedy and I were
doing:

1. You are never transferring those bytes to Java land (i.e. into a byte
or byte[] which is allocated on the heap) - data stays in native land.

2. You are not reading chars and hence also do not do the character
decoding.

You likely did some processing but we cannot see it. But I agree,
another test with nio would be in order to show whether there is more
potential for speedup.

Btw. a memory mapped file may also be an alternative. This can be done
with nio. It does have some downsides (e.g. different behavior on
different OS in my experience, address space) and proved not to be
faster than plain sequential reading with InputStreams in an application
I have been working on.

Kind regards

robert

markspace · May 11, 2014

1. You are never transferring those bytes to Java land (i.e. into a byte
or byte[] which is allocated on the heap) - data stays in native land.

2. You are not reading chars and hence also do not do the character
decoding.

Yes, I knew the data was "mostly ascii" and therefore I didn't have to
do character decoding. An efficient UTF-8 converter shouldn't be much
more complicated however.

I appear to be counting word lengths in the file, I'm not sure why at
this point. Some more found code:

FileInputStream fins = new FileInputStream( path.toFile() );
FileByteBufferInputStream fbbins =
new FileByteBufferInputStream( fins );

int charRead;
HashedHistogram histogram = new HashedHistogram();
charRead = fbbins.read();
StringBuilder sb = new StringBuilder();
while( charRead != -1 )
{
if( charRead < 128 && !Character.isWhitespace( charRead ) ) {
sb.append( (char) charRead );
charRead = fbbins.read();
} else {
histogram.add( sb.toString() );
sb.delete( 0, sb.length() );
while( (Character.isWhitespace( (charRead =
fbbins.read() )) ||
charRead >= 128) && charRead != -1 )
{
// nothing
}
}
}
System.out.println( histogram.size() + " words" );
Entry<Comparable,Integer>[] entries =
histogram.getSortedEntries();
System.out.println( "Bottom words:" );
for( int i = 0; i < 20; i++ )
System.out.println( entries.getKey()+",
"+entries.getValue() );
System.out.println( "Top words:" );
for( int i = entries.length-1; i > entries.length-41; i-- )
System.out.println( entries.getKey()+",
"+entries.getValue() );

Kind of ugly, but that's what I have.

Roedy Green · May 12, 2014

Instead the BufferedInputStream is getting part of a 'buffer space
allocation', which is set to 32K. This part varies from 10% to 90% of the
32K while the BufferedReader always gets 16K.

private static void withBufferedReader( double ratio, int buffsize
) throws IOException
{
assert .1 <= ratio && ratio <= .9 : "ratio must be in range
[0.1,0.9].";
long prevTime = System.nanoTime();
final FileInputStream fis = new FileInputStream(
sampleDataFile );
final BufferedInputStream bis = new BufferedInputStream( fis,
( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
final InputStreamReader isr = new InputStreamReader( bis,
EIO.UTF8Charset );
final BufferedReader ibr = new BufferedReader( isr, ( int ) (
buffsize * ratio / 2 /* in chars */ ) );

so for example if ratio = .5 and buffsize =32K,
then the stream will get 16K bytes and the reader will get 8k chars.

This is the optimum ratio.

Silvio · May 12, 2014

Instead the BufferedInputStream is getting part of a 'buffer space
allocation', which is set to 32K. This part varies from 10% to 90% of the
32K while the BufferedReader always gets 16K.

Click to expand...

private static void withBufferedReader( double ratio, int buffsize
) throws IOException
{
assert .1 <= ratio && ratio <= .9 : "ratio must be in range
[0.1,0.9].";
long prevTime = System.nanoTime();
final FileInputStream fis = new FileInputStream(
sampleDataFile );
final BufferedInputStream bis = new BufferedInputStream( fis,
( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
final InputStreamReader isr = new InputStreamReader( bis,
EIO.UTF8Charset );
final BufferedReader ibr = new BufferedReader( isr, ( int ) (
buffsize * ratio / 2 /* in chars */ ) );

so for example if ratio = .5 and buffsize =32K,
then the stream will get 16K bytes and the reader will get 8k chars.

This is the optimum ratio.

Roedy,

It would be interesting to repeat this experiment with different
character encodings and see if using UTF-16 versus UTF-8 makes a
difference here.

Silvio

Roedy Green · May 17, 2014

- The higher level the buffering, the more effective.

You are doing your own buffering, not using BufferedReader right?

In that case, when you read a char[] it might do a big physical i/o,
no?

In the days of 16K computers, when you did this sort of thing, the OS
would schedule a read to a second buffer while you were processing the
first. Oddly now that we have RAM to burn, we no longer do that though
I suppose the OS might do some speculative reading.

Roedy Green · May 17, 2014

ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
FileChannel fc = f.getChannel();
readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.asReadOnlyBuffer();

I would like this as an example of nio. I wonder why each step is
necessary.

// sample use of NIO to read a text file very rapidly
// by Mark Space

// alloc buffer suited for direct physical I/O
ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );

RandomAccessFile f = new RandomAccessFile( "blet", "rw" );

FileChannel fc = f.getChannel();

// A user method to consume one chunk of data
// It would have to convert bytes to chars, fretting over boundaries.
readToCapacity( fc, bb );

// request the next chunk of data
bb.flip();

// ?
ByteBuffer b2 = bb.slice();

// creates readonly version and discards
b2.asReadOnlyBuffer();

markspace · May 17, 2014

// ?
ByteBuffer b2 = bb.slice();

Yes, this bit of code was an amalgam of other bits that I had saved to
remind myself how to perform some common operations on ByteBuffers. The
longer sequence went something like:

readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.limit( 100 );
b2.asReadOnlyBuffer();
readHeader( b2 );
readFile( bb );

And it was just a way to be able to isolate the (100 byte) header from
the rest of the stream, as well as be able to retain all original bytes
without a potentially expensive seek operation. Mostly just as a
reminder how to slice up a ByteBuffer into smaller buffers so they can
be read independently.

// request the next chunk of data
bb.flip();

BTW, I'd make that comment "reset limit and position for reading" rather
than "next chuck" (the next chunk would probably be clear(), because if
the buffer isn't full you'd run into problems with limit using flip()).

Flip() is used to switch from put()s to get()s. See the docs:

http://docs.oracle.com/javase/7/docs/api/java/nio/Buffer.html#flip()

Roedy Green · May 17, 2014

lip() is used to switch from put()s to get()s. See the docs:

http://docs.oracle.com/javase/7/docs/api/java/nio/Buffer.html#flip()

I read that but those operations have nothing to do with the
traditional notions of I/O.

I have tried to make sense of this and figure out what it does that
ordinary I/O does not, especially since it legos on top of ordinary
I/O. I have wimped out twice before.

Roedy Green · May 17, 2014

It would be interesting to repeat this experiment with different
character encodings and see if using UTF-16 versus UTF-8 makes a
difference here.

With UTF-16 there are twice as many byte to read, but the 50% magic
ratio still prevails.

best 5 five trials shown.

Using a random sample data file of 209,715,200 chars 419,430,402
bytes.
Using aggregate buffersize of 65,536 bytes.
Using charset UTF-16

BufferedReader backed with BufferedInputStream ratio 0.10 buffsize
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.20 buffsize
65536 bytes 2.63 seconds
BufferedReader backed with BufferedInputStream ratio 0.30 buffsize
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.40 buffsize
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.50 buffsize <--
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.60 buffsize
65536 bytes 2.68 seconds
BufferedReader backed with BufferedInputStream ratio 0.70 buffsize
65536 bytes 2.71 seconds
BufferedReader backed with BufferedInputStream ratio 0.80 buffsize
65536 bytes 2.79 seconds
BufferedReader backed with BufferedInputStream ratio 0.90 buffsize
65536 bytes 2.70 seconds
HunkIO 2.70 seconds

Using a random sample data file of 209,715,200 chars 209,715,200
bytes.
Using aggregate buffersize of 65,536 bytes.
Using charset UTF-8
HunkIO 0.73 seconds
BufferedReader backed with BufferedInputStream ratio 0.10 buffsize
65536 bytes 0.89 seconds
BufferedReader backed with BufferedInputStream ratio 0.20 buffsize
65536 bytes 0.88 seconds
BufferedReader backed with BufferedInputStream ratio 0.30 buffsize
65536 bytes 0.88 seconds
BufferedReader backed with BufferedInputStream ratio 0.40 buffsize
65536 bytes 0.89 seconds
BufferedReader backed with BufferedInputStream ratio 0.50 buffsize <--
65536 bytes 0.83 seconds
BufferedReader backed with BufferedInputStream ratio 0.60 buffsize
65536 bytes 0.91 seconds
BufferedReader backed with BufferedInputStream ratio 0.70 buffsize
65536 bytes 0.92 seconds
BufferedReader backed with BufferedInputStream ratio 0.80 buffsize
65536 bytes 0.96 seconds
BufferedReader backed with BufferedInputStream ratio 0.90 buffsize
65536 bytes 0.92 seconds

There is something strange. UTF-16 should be faster to convert to
Strings, and at worst, taking twice as long for physical I/O.

Roedy Green · May 18, 2014

It would be interesting to repeat this experiment with different
character encodings and see if using UTF-16 versus UTF-8 makes a
difference here.

UTF-16 ends up reading twice as many bytes. However the magic 50%
ratio is still optimal.

Using a random sample data file of 209,715,200 chars 419,430,402
bytes.
Using aggregate buffersize of 65,536 bytes.
Using charset UTF-16
end read using HunkIO 2.65 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.10 buffsize 65536 bytes 2.70 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.20 buffsize 65536 bytes 2.68 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.30 buffsize 65536 bytes 2.67 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.40 buffsize 65536 bytes 2.68 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.50 buffsize 65536 bytes 2.61 seconds <<<
end read using using BufferedReader backed with BufferedInputStream
ratio 0.60 buffsize 65536 bytes 2.72 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.70 buffsize 65536 bytes 2.73 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.80 buffsize 65536 bytes 2.78 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.90 buffsize 65536 bytes 2.72 seconds

stream bytes	2	Dec 5, 2011
XMLDecoder hangs when unmarshalling from a stream	1	Aug 29, 2009
Questions about buffered streams	28	Jan 7, 2008
NIO not so hot	15	May 17, 2014
Problem Writing Binary Data Stream To File	12	May 17, 2006
Reading Java byte[] data stream over standard input	6	May 19, 2008
read stream from web service	0	Nov 4, 2005
Strict aliasing and buffer handling	20	Jun 20, 2011

Byte Stream Vs Char Stream Buffer

Roedy Green

Andreas Leitgeb

Roedy Green

Roedy Green

Robert Klemme

Roedy Green

Roedy Green

Robert Klemme

markspace

markspace

Robert Klemme

markspace

Roedy Green

Silvio

Roedy Green

Roedy Green

markspace

Roedy Green

Roedy Green

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads