Byte Stream Vs Char Stream Buffer

R

Roedy Green

When you do your lego thing to create a BufferedReader or
BufferedWriter, you get to specify two buffer sizes, one in bytes for
the byte stream and one in chars for the character stream.

Given that you have N bytes to spend on buffers for this file, how
should you allocate them?

I have been told the best ratio is 32Kbytes for every 24K chars. I
can's see why that would be. I would have thought most should go to
the byte stream for physical i/o. Has anyone ever experimented?
 
A

Andreas Leitgeb

Roedy Green said:
When you do your lego thing to create a BufferedReader or
BufferedWriter, you get to specify two buffer sizes, one in bytes for
the byte stream and one in chars for the character stream.

I'm wondering, if your lego stack is higher than necessary. Are you
placing BufferedWriter over BufferedOutputStream? I think you can
put the BufferedOutputStream back into the lego bag in that case. ;)
 
R

Roedy Green

I'm wondering, if your lego stack is higher than necessary. Are you
placing BufferedWriter over BufferedOutputStream? I think you can
put the BufferedOutputStream back into the lego bag in that case. ;)

I think some experiments are in order.
 
R

Roedy Green

I think some experiments are in order.

There are the results:
Using a file of random characters of 400 MB and buffers of 32K bytes

HunkIO 1.20 seconds <-- does an all at once read.
using BufferedReader backed with BufferedInputStream ratio 0.50
buffsize 32768 bytes 1.65 seconds
plain BufferedReader buffsize 32768 bytes 1.76 seconds

The bottom line is either use HunkIO when you have the RAM or give 50%
of your buffer space to the inputstream and 50% to the Reader
in other works 100K of buffer space would have a 50K inputstream
buffer and the reader would have a 25K buffer because it is measured
in chars. I tried ratios from .1 to .9. 50% was best.

There is a second problem with plain buffered readers:

final BufferedReader ibr = new BufferedReader( new FileReader(
sampleDataFile ), buffsize / 2 );

There is no place to put the encoding.


The "lego" (encapsulated) is what you should use for big files:

double ratio = .5;

final FileInputStream fis = new FileInputStream( sampleDataFile );
final BufferedInputStream bis = new BufferedInputStream( fis,
( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
final InputStreamReader isr = new InputStreamReader( bis,
UTF8Charset );
final BufferedReader ibr = new BufferedReader( isr, ( int ) (
buffsize * ratio / 2 /* in chars */ ) );

I have an SSD. I will retest this on a hard drive.
 
R

Robert Klemme

There are the results:
Using a file of random characters of 400 MB and buffers of 32K bytes

HunkIO 1.20 seconds <-- does an all at once read.
using BufferedReader backed with BufferedInputStream ratio 0.50
buffsize 32768 bytes 1.65 seconds
plain BufferedReader buffsize 32768 bytes 1.76 seconds

I find this a fairly unspecific description of the test you did. We do
not know what your test did to ensure file system caching or GC effects
did not influence the measurement etc.
The bottom line is either use HunkIO when you have the RAM or give 50%
of your buffer space to the inputstream and 50% to the Reader
in other works 100K of buffer space would have a 50K inputstream
buffer and the reader would have a 25K buffer because it is measured
in chars. I tried ratios from .1 to .9. 50% was best.

If I'm not mistaken you did not test cases where there is just a
BufferedReader OR a BufferedInputStream - that was what Andreas was
hinting at if I'm not mistaken. I don't think duplicate buffering will
make things better. I had expected at least three tests (double
buffering, byte buffering, char buffering) plus the code.

Cheers

robert
 
R

Roedy Green

I find this a fairly unspecific description of the test you did. We do
not know what your test did to ensure file system caching or GC effects
did not influence the measurement etc.

the code is posted at
http://mindprod.com/applet/fileio.html#BUFFERSIZE

These are not particularly subtle effects. I ran five trials and took
the best time for each. I figured this would eliminate hotspot
initialisation and GC effects.
 
R

Roedy Green

I don't think duplicate buffering will
make things better.

It does. And allocating your space exactly 50:50 makes a noticeable
difference.

the plain Buffered reader was done like this:

final BufferedReader ibr = new BufferedReader( new FileReader(
sampleDataFile ), buffsize / 2 );

It got whatever the default buffering on the internal inputstream is.
It is not as fast as specifying buffers for the internal inputstream,
lego-style.
 
R

Robert Klemme


I followed a slightly different approach and modularized my test a bit
more. You can find it here:

https://gist.github.com/rklemme/fad399b5e7cc4d3b6d0c

File opening and closing is not measured. You can assume the data to
purely come from memory (vastly more mem than file size + no disk
activity shown). This was on an Xbuntu 14.04 64 bit, kernel version
3.13.0-24. No special JVM command line switches were used.
These are not particularly subtle effects. I ran five trials and took
the best time for each. I figured this would eliminate hotspot
initialisation and GC effects.

Sounds good. I followed the same approach (i.e. repeating and using the
min time value).

My results:

- The higher level the buffering, the more effective.
- Buffering on byte level often actually rather hurts than helps.
- Reading in 1k chunks from the application level is already close to
optimal.

Kind regards

robert
 
M

markspace

- Reading in 1k chunks from the application level is already close to
optimal.


Good test suit here all around, Robert. I just wanted address this one
point quickly. I'm not sure if this is new information or not, but in
actual use, I've found that it helps to allocate large buffers when
you're reading a large text file that must be processed quickly.

Sample:

ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
FileChannel fc = f.getChannel();
readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.asReadOnlyBuffer();

Allocating a direct byte buffer of 64k seems to help text processing
speed by about an order of magnitude. Files that take minutes to
process go to seconds when large direct byte buffers are used. It seems
important to get that first big chunk of bytes in before doing anything
else (like converting to chars).

The input file for this was about 600k, so if you're reading much
smaller files the difference might not show up.

This is just a sample, I don't seem to be able to find the original, so
I hope it's accurate. Here's the static method used by the above code:

public static void readToCapacity( FileChannel fc, ByteBuffer buf )
throws IOException
{
final int capacity = buf.capacity();
int totalRead = 0;
int bytesRead = 0;
while( (bytesRead = fc.read( buf ) ) != -1 ) {
totalRead += bytesRead;
if( totalRead == capacity ) break;
}
}
 
M

markspace

I think this might be the original code that I used to speed up my
project. It's been a while so I don't know exactly how well this is
working right now.

You can wrap this in whatever you'd like -- BufferedRead for example --
since I just make a plain InputStream here. I also notice that there's
not even a close method on this class, so obviously I didn't complete it.

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;

/**
* Copyright 2013 Brenden Towey DONATED TO THE PUBLIC DOMAIN.
*
* PROVIDED AS-IS WITH NO WARRANTY. FURTHER TESTING IS
* REQUIRED TO COMPLETE THIS CODE. CAVEAT EMPTOR.
*
* @author Brenden Towey
*/
public class FileByteBufferInputStream extends InputStream {

private ByteBuffer buffer = ByteBuffer.allocateDirect( 64 * 1024 );
private FileInputStream file;
int read;
boolean eof = false;

public FileByteBufferInputStream( FileInputStream file ) throws
IOException
{
this.file = file;
read = file.getChannel().read( buffer );
if( read == -1 ) eof = true;
buffer.rewind();
}

@Override
public int read()
throws IOException
{
if( eof ) return -1;
if( read-- == 0 )
{
buffer.clear();
read = file.getChannel().read( buffer );
if( read == -1 ) eof = true;
buffer.rewind();
if( eof ) {
return -1;
} else {
int retval = buffer.get();
assert( read > 0 );
read--;
return retval;
}
} else{
int retval = buffer.get();
return retval;
}
}
}
 
R

Robert Klemme

Good test suit here all around, Robert.

Thank you!
I just wanted address this one
point quickly. I'm not sure if this is new information or not, but in
actual use, I've found that it helps to allocate large buffers when
you're reading a large text file that must be processed quickly.

Sample:

ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
FileChannel fc = f.getChannel();
readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.asReadOnlyBuffer();

Allocating a direct byte buffer of 64k seems to help text processing
speed by about an order of magnitude. Files that take minutes to
process go to seconds when large direct byte buffers are used. It seems
important to get that first big chunk of bytes in before doing anything
else (like converting to chars).

I seem to recall having done similar tests in the past but found that
overall the benefit for processing speed was negligible because at some
point the data has to be moved into the Java heap. But my memory is
faint here and the situation may have changed in the meantime.
The input file for this was about 600k, so if you're reading much
smaller files the difference might not show up.

This is just a sample, I don't seem to be able to find the original, so
I hope it's accurate. Here's the static method used by the above code:

public static void readToCapacity( FileChannel fc, ByteBuffer buf )
throws IOException
{
final int capacity = buf.capacity();
int totalRead = 0;
int bytesRead = 0;
while( (bytesRead = fc.read( buf ) ) != -1 ) {
totalRead += bytesRead;
if( totalRead == capacity ) break;
}
}

The code shown so far has some differences to the tests Roedy and I were
doing:

1. You are never transferring those bytes to Java land (i.e. into a byte
or byte[] which is allocated on the heap) - data stays in native land.

2. You are not reading chars and hence also do not do the character
decoding.

You likely did some processing but we cannot see it. But I agree,
another test with nio would be in order to show whether there is more
potential for speedup.

Btw. a memory mapped file may also be an alternative. This can be done
with nio. It does have some downsides (e.g. different behavior on
different OS in my experience, address space) and proved not to be
faster than plain sequential reading with InputStreams in an application
I have been working on.

Kind regards

robert
 
M

markspace

1. You are never transferring those bytes to Java land (i.e. into a byte
or byte[] which is allocated on the heap) - data stays in native land.

2. You are not reading chars and hence also do not do the character
decoding.


Yes, I knew the data was "mostly ascii" and therefore I didn't have to
do character decoding. An efficient UTF-8 converter shouldn't be much
more complicated however.

I appear to be counting word lengths in the file, I'm not sure why at
this point. Some more found code:

FileInputStream fins = new FileInputStream( path.toFile() );
FileByteBufferInputStream fbbins =
new FileByteBufferInputStream( fins );

int charRead;
HashedHistogram histogram = new HashedHistogram();
charRead = fbbins.read();
StringBuilder sb = new StringBuilder();
while( charRead != -1 )
{
if( charRead < 128 && !Character.isWhitespace( charRead ) ) {
sb.append( (char) charRead );
charRead = fbbins.read();
} else {
histogram.add( sb.toString() );
sb.delete( 0, sb.length() );
while( (Character.isWhitespace( (charRead =
fbbins.read() )) ||
charRead >= 128) && charRead != -1 )
{
// nothing
}
}
}
System.out.println( histogram.size() + " words" );
Entry<Comparable,Integer>[] entries =
histogram.getSortedEntries();
System.out.println( "Bottom words:" );
for( int i = 0; i < 20; i++ )
System.out.println( entries.getKey()+",
"+entries.getValue() );
System.out.println( "Top words:" );
for( int i = entries.length-1; i > entries.length-41; i-- )
System.out.println( entries.getKey()+",
"+entries.getValue() );


Kind of ugly, but that's what I have.
 
R

Roedy Green

Instead the BufferedInputStream is getting part of a 'buffer space
allocation', which is set to 32K. This part varies from 10% to 90% of the
32K while the BufferedReader always gets 16K.


private static void withBufferedReader( double ratio, int buffsize
) throws IOException
{
assert .1 <= ratio && ratio <= .9 : "ratio must be in range
[0.1,0.9].";
long prevTime = System.nanoTime();
final FileInputStream fis = new FileInputStream(
sampleDataFile );
final BufferedInputStream bis = new BufferedInputStream( fis,
( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
final InputStreamReader isr = new InputStreamReader( bis,
EIO.UTF8Charset );
final BufferedReader ibr = new BufferedReader( isr, ( int ) (
buffsize * ratio / 2 /* in chars */ ) );


so for example if ratio = .5 and buffsize =32K,
then the stream will get 16K bytes and the reader will get 8k chars.

This is the optimum ratio.
 
S

Silvio

Instead the BufferedInputStream is getting part of a 'buffer space
allocation', which is set to 32K. This part varies from 10% to 90% of the
32K while the BufferedReader always gets 16K.


private static void withBufferedReader( double ratio, int buffsize
) throws IOException
{
assert .1 <= ratio && ratio <= .9 : "ratio must be in range
[0.1,0.9].";
long prevTime = System.nanoTime();
final FileInputStream fis = new FileInputStream(
sampleDataFile );
final BufferedInputStream bis = new BufferedInputStream( fis,
( int ) ( buffsize * ( 1 - ratio ) ) ); /* buffsize in bytes */
final InputStreamReader isr = new InputStreamReader( bis,
EIO.UTF8Charset );
final BufferedReader ibr = new BufferedReader( isr, ( int ) (
buffsize * ratio / 2 /* in chars */ ) );


so for example if ratio = .5 and buffsize =32K,
then the stream will get 16K bytes and the reader will get 8k chars.

This is the optimum ratio.

Roedy,

It would be interesting to repeat this experiment with different
character encodings and see if using UTF-16 versus UTF-8 makes a
difference here.

Silvio
 
R

Roedy Green

- The higher level the buffering, the more effective.

You are doing your own buffering, not using BufferedReader right?

In that case, when you read a char[] it might do a big physical i/o,
no?

In the days of 16K computers, when you did this sort of thing, the OS
would schedule a read to a second buffer while you were processing the
first. Oddly now that we have RAM to burn, we no longer do that though
I suppose the OS might do some speculative reading.
 
R

Roedy Green

ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );
RandomAccessFile f = new RandomAccessFile( "blet", "rw" );
FileChannel fc = f.getChannel();
readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.asReadOnlyBuffer();

I would like this as an example of nio. I wonder why each step is
necessary.


// sample use of NIO to read a text file very rapidly
// by Mark Space

// alloc buffer suited for direct physical I/O
ByteBuffer bb = ByteBuffer.allocateDirect( 64 * 1024 );

RandomAccessFile f = new RandomAccessFile( "blet", "rw" );

FileChannel fc = f.getChannel();

// A user method to consume one chunk of data
// It would have to convert bytes to chars, fretting over boundaries.
readToCapacity( fc, bb );

// request the next chunk of data
bb.flip();

// ?
ByteBuffer b2 = bb.slice();

// creates readonly version and discards
b2.asReadOnlyBuffer();
 
M

markspace

// ?
ByteBuffer b2 = bb.slice();

Yes, this bit of code was an amalgam of other bits that I had saved to
remind myself how to perform some common operations on ByteBuffers. The
longer sequence went something like:

readToCapacity( fc, bb );
bb.flip();
ByteBuffer b2 = bb.slice();
b2.limit( 100 );
b2.asReadOnlyBuffer();
readHeader( b2 );
readFile( bb );

And it was just a way to be able to isolate the (100 byte) header from
the rest of the stream, as well as be able to retain all original bytes
without a potentially expensive seek operation. Mostly just as a
reminder how to slice up a ByteBuffer into smaller buffers so they can
be read independently.
// request the next chunk of data
bb.flip();

BTW, I'd make that comment "reset limit and position for reading" rather
than "next chuck" (the next chunk would probably be clear(), because if
the buffer isn't full you'd run into problems with limit using flip()).

Flip() is used to switch from put()s to get()s. See the docs:

http://docs.oracle.com/javase/7/docs/api/java/nio/Buffer.html#flip()
 
R

Roedy Green

It would be interesting to repeat this experiment with different
character encodings and see if using UTF-16 versus UTF-8 makes a
difference here.

With UTF-16 there are twice as many byte to read, but the 50% magic
ratio still prevails.

best 5 five trials shown.

Using a random sample data file of 209,715,200 chars 419,430,402
bytes.
Using aggregate buffersize of 65,536 bytes.
Using charset UTF-16

BufferedReader backed with BufferedInputStream ratio 0.10 buffsize
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.20 buffsize
65536 bytes 2.63 seconds
BufferedReader backed with BufferedInputStream ratio 0.30 buffsize
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.40 buffsize
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.50 buffsize <--
65536 bytes 2.64 seconds
BufferedReader backed with BufferedInputStream ratio 0.60 buffsize
65536 bytes 2.68 seconds
BufferedReader backed with BufferedInputStream ratio 0.70 buffsize
65536 bytes 2.71 seconds
BufferedReader backed with BufferedInputStream ratio 0.80 buffsize
65536 bytes 2.79 seconds
BufferedReader backed with BufferedInputStream ratio 0.90 buffsize
65536 bytes 2.70 seconds
HunkIO 2.70 seconds





Using a random sample data file of 209,715,200 chars 209,715,200
bytes.
Using aggregate buffersize of 65,536 bytes.
Using charset UTF-8
HunkIO 0.73 seconds
BufferedReader backed with BufferedInputStream ratio 0.10 buffsize
65536 bytes 0.89 seconds
BufferedReader backed with BufferedInputStream ratio 0.20 buffsize
65536 bytes 0.88 seconds
BufferedReader backed with BufferedInputStream ratio 0.30 buffsize
65536 bytes 0.88 seconds
BufferedReader backed with BufferedInputStream ratio 0.40 buffsize
65536 bytes 0.89 seconds
BufferedReader backed with BufferedInputStream ratio 0.50 buffsize <--
65536 bytes 0.83 seconds
BufferedReader backed with BufferedInputStream ratio 0.60 buffsize
65536 bytes 0.91 seconds
BufferedReader backed with BufferedInputStream ratio 0.70 buffsize
65536 bytes 0.92 seconds
BufferedReader backed with BufferedInputStream ratio 0.80 buffsize
65536 bytes 0.96 seconds
BufferedReader backed with BufferedInputStream ratio 0.90 buffsize
65536 bytes 0.92 seconds

There is something strange. UTF-16 should be faster to convert to
Strings, and at worst, taking twice as long for physical I/O.
 
R

Roedy Green

It would be interesting to repeat this experiment with different
character encodings and see if using UTF-16 versus UTF-8 makes a
difference here.

UTF-16 ends up reading twice as many bytes. However the magic 50%
ratio is still optimal.


Using a random sample data file of 209,715,200 chars 419,430,402
bytes.
Using aggregate buffersize of 65,536 bytes.
Using charset UTF-16
end read using HunkIO 2.65 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.10 buffsize 65536 bytes 2.70 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.20 buffsize 65536 bytes 2.68 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.30 buffsize 65536 bytes 2.67 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.40 buffsize 65536 bytes 2.68 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.50 buffsize 65536 bytes 2.61 seconds <<<
end read using using BufferedReader backed with BufferedInputStream
ratio 0.60 buffsize 65536 bytes 2.72 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.70 buffsize 65536 bytes 2.73 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.80 buffsize 65536 bytes 2.78 seconds
end read using using BufferedReader backed with BufferedInputStream
ratio 0.90 buffsize 65536 bytes 2.72 seconds
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top