Fell Swoop I/O

R

Roedy Green

Writing or reading a byte[] in one fell swoop to write or read a file
should be extremely efficient. In theory, the bytes could go straight
from your array to the hard disk controller.

I wonder if that is indeed true, for unbuffered files. Or are they
copied some sub-chunk size at a time. Has anyone peeked under the hood
or done some experiments to deduce what happens from timings.

Encoding though, even when you have a 1-1 char > byte encoding
requires Java to allocate some sort of transparent intermediate byte
buffer, even for unbuffered Writers. How does Java decide how big to
make it? Does it make it big enough to contain the entire String?

Has anyone peeked under the hood or experimented.

A practical way of asking this question is:

It is better write an entire file unbuffered or write an entire file
with a buffer? If buffered, what is a reasonable buffer size? Making
it too big causes more frequent GC. Making it too small causes more
physical i/os.

Here is a place I would like tweakers where you could write your code
and let the tweaker optimiser AT THE CLIENT SITE home in the optimum
settings for his platform.

see http://mindprod.com/jgloss/tweakable.html
 
N

NOBODY

Writing or reading a byte[] in one fell swoop to write or read a file
should be extremely efficient. In theory, the bytes could go straight
from your array to the hard disk controller.

I wonder if that is indeed true, for unbuffered files. Or are they
copied some sub-chunk size at a time. Has anyone peeked under the hood
or done some experiments to deduce what happens from timings.

Encoding though, even when you have a 1-1 char > byte encoding
requires Java to allocate some sort of transparent intermediate byte
buffer, even for unbuffered Writers. How does Java decide how big to
make it? Does it make it big enough to contain the entire String?

Has anyone peeked under the hood or experimented.

A practical way of asking this question is:

It is better write an entire file unbuffered or write an entire file
with a buffer? If buffered, what is a reasonable buffer size? Making
it too big causes more frequent GC. Making it too small causes more
physical i/os.


Let's think about what sun did since '95...
FileOutputStream has 2 native methods:
write(byte)
write(byte[], off, len)
that thousands of classes depend on.
Even the NIO channels are slower as I heard, since they were designed for
Selectable and locks, not so much for performance.
So, yeah, safe to say it is fast enough.


Optimal byte[] buffer size come from one thing: TESTING.

Keep it a power of your cluster size to be friendly, trust HDD
controllers and i/o schedulers pull at least the cluster size with all
sorts of 'read' or 'write' prediction, exploiting the disk cache.

Understand that 2 long writes at the same time on a single hdd will make
its head jump all over and drop to much less than just half the
performance. Your tests could be biased is your are swapping of other
disk activities.

The largest chunk possible, to reduce the i/o scheduling pieces and
reassembly and hope the i/o scheduler will thank you for a big contiguous
array of bytes.
 
R

Roedy Green

The largest chunk possible, to reduce the i/o scheduling pieces and
reassembly and hope the i/o scheduler will thank you for a big contiguous
array of bytes.

There are some complications from the traditional wisdom.

1. Java's buffering can be inserted at various layers. Only the lowest
layer offers any help for I/O.

2. Java does encoding transformations. This implies hidden buffers of
which you have no control.

I need to do some experiments, but I think the fastest way to read a
file of chars will be:

1. find the length in bytes. This is not necessarily the length in
chars.

2. read the entire file in one read (buffered or unbuffered?) onto a
byte[].

3. use a new String which has a built in encoding conversion.
 
A

Andrey Kuznetsov

The largest chunk possible, to reduce the i/o scheduling pieces and
There are some complications from the traditional wisdom.

1. Java's buffering can be inserted at various layers. Only the lowest
layer offers any help for I/O.

Roedy,

just put Unified I/O in lowest layer and forget about performance.

I memorize that you asked me about tutorial.

However I don't have it yet, but I can give you some advices:

Unified I/O interface looks just like from RandomAccessFile (with some
extras).

Important thing is RandomAccessFactory.

It has following methods:

RandomAccess create();
RandomAccessRO createRO();
RandomAccessBuffer createBuffered();
RandomAccessBufferRO createBufferedRO();

(RO means read only)

It was difficult part.

Easy part is that you can create InputStream from RandomAccessRO
or OutputStream from RandomAccess
and use it as usual without changing your code.
See com.imagero.uio.io.RandomAccessInputStream
and com.imagero.uio.io.RandomAccessOutputStream.
 
N

NOBODY

The largest chunk possible, to reduce the i/o scheduling pieces and
There are some complications from the traditional wisdom.

1. Java's buffering can be inserted at various layers. Only the lowest
layer offers any help for I/O.


To me a simple file output stream is the closest to the i/o chunk.
Just do your buffering yourself is layer of uncontrolled buffering scares
you. But you did say you had files, not streams. So you control how it is
read.

My i/o test:

-----
import java.io.File;
import java.io.FileOutputStream;

public class IOSizer {
public static void main(String[] args) throws Exception {
File f = File.createTempFile("_IOSizer_",".tmp",new File
("."));
f.deleteOnExit();
FileOutputStream fos = new FileOutputStream(f);
try {
fos.write(1);
fos.write(new byte[Integer.parseInt(args[0])]);
} finally {
fos.close();
}
}
}


---- and trace system write calls ----
/usr/bin/strace -x -e write java IOSizer 33333

[...]
write(5, "\x01", 1) = 1
write(5, "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"...,
33333) = 33333
[...]



2. Java does encoding transformations. This implies hidden buffers of
which you have no control.

If your first stream is a bufferedinputstream (over file inputstream) of
a buffer size of your choice, the only buffers are supposed to be a few
bytes long, most probably reused, for the longest charset sequence, for
which I know utf-8 is probably one of at least 6 bytes (31 bits payload,
4 bytes for 20 bit unicode).


I need to do some experiments, but I think the fastest way to read a
file of chars will be:

1. find the length in bytes. This is not necessarily the length in
chars.

2. read the entire file in one read (buffered or unbuffered?) onto a
byte[].

3. use a new String which has a built in encoding conversion.


How were you intending to read a unique string otherwise? :-/
But if you can process your html in chunks (tabs, spaces, and all you
mentionned), You can probably just use a buffered reader over a
intputstream reader over the bufferedinputstream. Read a pack of lines
(like 200, or when you reached a string length threshold), and process it
in smaller pieces, keeping a stateful engine of where you are (opened
tags and such annoying things.)
 
N

NOBODY

RandomAccess create();
RandomAccessRO createRO();
RandomAccessBuffer createBuffered();
RandomAccessBufferRO createBufferedRO();


Simpler: knowing that a seek on a RAF will move the FD with it,
you can reposition buffered streams on it. Here:
(I was too lazy to implement the DataInput and DataOutput, but you get
the point)



-----------

import java.io.*;

public class SuperRAF {

public final RandomAccessFile raf;
public final MyBIS bis;
public final BufferedOutputStream bos;
public final DataInputStream dis;
public final DataOutputStream dos;

public SuperRAF(RandomAccessFile raf, int bufsize) throws
IOException {
this.raf = raf;
bis = new MyBIS(new FileInputStream(raf.getFD()), bufsize);
bos = new BufferedOutputStream(new FileOutputStream(raf.getFD
()), bufsize);
dis = new DataInputStream(bis);
dos = new DataOutputStream(bos);
}


public void flush() throws IOException {
bos.flush();
}

public void seek(long pos) throws IOException {
bos.flush();
bis.clear();
raf.seek(pos);
}

//=======

static class MyBIS extends BufferedInputStream {
MyBIS(InputStream is, int size) {
super(is, size);
}

MyBIS(InputStream is) {
super(is);
}

void clear() {
super.count = 0;
super.markpos = -1;
super.pos = 0;
super.marklimit = 0;
//super.buf = don't waste that
}
}

}
 
A

Andrey Kuznetsov

Simpler: knowing that a seek on a RAF will move the FD with it,
you can reposition buffered streams on it.

oh yes, and with raf.seek(0) you can just revind your IS.
 
T

Thomas Hawtin

Roedy said:
Writing or reading a byte[] in one fell swoop to write or read a file
should be extremely efficient. In theory, the bytes could go straight
from your array to the hard disk controller.

Almost certainly the biggest overhead here is going to be with the disc
drive. Depending on circumstances the seek time or transfer time for
long files. Possibly if buffering causes a spike in memory usage, there
could be other problems.

There will be at least one additional copy for your operating system's
file cache. Also you aren't going to want your byte[] pinned while the
file system blocks, direct allocated ByteBuffers may be a win (for the
careful, or carefree).
It is better write an entire file unbuffered or write an entire file
with a buffer? If buffered, what is a reasonable buffer size? Making
it too big causes more frequent GC. Making it too small causes more
physical i/os.

I suspect there is a huge middle ground, where the exact size doesn't
matter.

Memory mapping is another way to go.

Tom Hawtin
 
D

Dimitri Maziuk

Roedy Green sez:
Writing or reading a byte[] in one fell swoop to write or read a file
should be extremely efficient. In theory, the bytes could go straight
from your array to the hard disk controller.

There are a couple of buffering stages involved even before
the data gets to JVM:

1. HD read and writes are done in chunks (> 1 byte, configurable
on some systems).

2. Assuming a single disk, the slowest part of file copy process
is positioning disk head to write to destination file and then
re-positioning it back to read from the source. So OS and/or HD
controller buffer I/O requests and schedule them for optimal head
movement.

3. File data is buffered by OS (size depends on OS, available RAM,
number of open files, etc.)

(Now add concurrent I/O requests coming from multiple processes
on a time-sharing system to the mix.)

4. Then the data gets to JVM which (or may not) do still more
buffering.

5. Finally, you code yet another buffer -- your byte[] -- on
top of all that.

In theory, if you could read the entire file into byte[] and
then write the entire thing out, it should be the fastest:
let JVM, OS, and hardware optimize the actual disk I/O. In
practice you seldom have enough RAM for that.

In practice, with all that stuff going on behind the scenes
(that you have no control over), I wouldn't worry about it
at all: code what makes sense for your application. I tend
to use buffered readers when I need line-based reads -- not
because it's supposed to be faster but because I need readLine().

Dima
 
R

Raymond DeCampo

Roedy said:
It is better write an entire file unbuffered or write an entire file
with a buffer? If buffered, what is a reasonable buffer size? Making
it too big causes more frequent GC. Making it too small causes more
physical i/os.

Roedy,

What is your reasoning behind saying that a large buffer causes more
frequent garbage collection?

Thanks,
Ray
 
A

Andrey Kuznetsov

In practice, with all that stuff going on behind the scenes
(that you have no control over), I wouldn't worry about it
at all: code what makes sense for your application. I tend
to use buffered readers when I need line-based reads -- not
because it's supposed to be faster but because I need readLine().

For small files you can safely ignore buffering.
For huge files buffering can significantly speed up I/O.
 
R

Roedy Green

What is your reasoning behind saying that a large buffer causes more
frequent garbage collection?

Imagine a case where you had 1000 files each 100 bytes long and you
allocated 64K buffers. You will fill up ram faster than had you use
no buffering or 100 byte buffers.
 
R

Raymond DeCampo

Roedy said:
Imagine a case where you had 1000 files each 100 bytes long and you
allocated 64K buffers. You will fill up ram faster than had you use
no buffering or 100 byte buffers.

I see; I thought you meant in the case where there was one buffer and I
could not imagine how that applied.

Ray
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,781
Messages
2,569,615
Members
45,303
Latest member
Ketonara

Latest Threads

Top