Hmm. I'm starting to think that you might want to take that option...
What I had in mind was a simple loop where, at each step, you feed 1
byte to the CharsetDecoder and get back 0, 1, or 2 chars.
Nice idea that's unfortunately necessary because CharsetDecoder omits a
decodeChar() method.
Unfortunately I was wrong. Although the documentation doesn't say so,
and although the design is clearly set up to be used like that, it
doesn't work. At least the UTF-8 decoder doesn't work if used like that.
It doesn't retain enough state to remember that it has seen the start
of an encoded character, and so it cannot be trusted to decode
sucessfully across buffer boundaries (I don't know whether that's a bug
or simply that it isn't expected to be able to do so).
I've worked around this.
So I think that the loop has to look more like
0) clear a small buffer
1) get the next byte
2) append it to the small buffer
3) attempt to decode that into up to 2 chars
4) if that works[*] then process the chars and goto (0)
5) goto (1)
and that -- when expressed using the magic of nio ByteBuffers and
CharBuffers -- looks as it'd be extremely messy...
([*] by "works" I mean produces at least 1 char)
My approach is similar. I fill a byte buffer (currently 1024 bytes) using
a bulk read operation. Without any further copying I supply successive
windows of the byte buffer to the charset decoder which places the decoded
result into a char buffer of size 2. If there is no result the byte buffer
position is reset to its previous value and the buffer limit is increased
by 1 (leading to a visible byte window of 1, 2, 3, ... bytes). If the
limit exceeds the size of the byte buffer then the few bytes yet to be
decoded are copied back to the start of the byte buffer and the rest of
the byte buffer is filled via another bulk read operation. Eventually a
character is read.
There will be bugs in the implementation below. I have successfully run a
test class that includes a selection of Unicode code points and 256 bytes
of binary data. To execute:
javac BivalentInputStream.java BivalentInputStreamTest.java &&
java BivalentInputStreamTest
Many thanks for the feedback Chris.
Regards,
Adam
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
public class BivalentInputStream
{
public static int bufSize=1024;
private InputStream in;
private ByteBuffer bb=ByteBuffer.allocate(bufSize);
private byte[] ba=bb.array();
private int maxLimit;
private CharsetDecoder decoder;
private CharBuffer cb=CharBuffer.allocate(2); //support surrogate chars
private char[] ca=cb.array();
/** @return The number of bytes read into the buffer. */
private int saneBulkRead(byte[] b, int offset) throws java.io.IOException {
int brokenNumBytesRead=in.read(b, offset, b.length-offset);
if (brokenNumBytesRead==-1) return 0;
return brokenNumBytesRead;
}
public BivalentInputStream(InputStream in) throws java.io.IOException {
this.in=in;
maxLimit=saneBulkRead(ba, 0);
bb.limit(1);
decoder=Charset.defaultCharset().newDecoder();
}
public BivalentInputStream(InputStream in, Charset cs) throws java.io.IOException {
this.in=in;
maxLimit=saneBulkRead(ba, 0);
bb.limit(1);
this.decoder=cs.newDecoder();
}
private char cachedSurrogate;
private boolean storedSurrogate=false;
/** @return '\uFFFF' if the stream is exhausted or the remaining bytes
do not comprise a 16-bit char. */
public char readChar() throws java.io.IOException {
if (storedSurrogate==true) {
storedSurrogate=false;
return cachedSurrogate;
}
int codePoint=readCodePoint();
if (codePoint==-1) return '\uFFFF';
if (codePoint>0xFFFF) {
char[] chars=Character.toChars(codePoint);
storedSurrogate=true;
cachedSurrogate=chars[1];
return chars[0];
}
return (char) codePoint;
}
/** @return -1 if the stream is exhausted or the remaining bytes
do not comprise a Unicode code point. */
public int readCodePoint() throws java.io.IOException {
//Buffer refill logic
if (bb.position()==maxLimit) {
if (maxLimit==0) return -1;
//refill the byte buffer after moving the remaining bytes up to position 0
int remainingBytes=maxLimit-bb.position();
System.arraycopy(ba, bb.position(), ba, 0, remainingBytes);
maxLimit=saneBulkRead(ba, remainingBytes);
if (maxLimit==0) return -1; //remaining bytes do not comprise a code point
maxLimit+=remainingBytes;
bb.position(0);
bb.limit(remainingBytes+1);
}
cb.position(0);
int bbStartPos=bb.position();
decoder.reset();
CoderResult result=decoder.decode(bb, cb, true);
decoder.flush(cb);
if (result==CoderResult.UNDERFLOW) {
if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
return Character.codePointAt(ca, 0);
}
bb.position(bbStartPos);
bb.limit(bb.limit()+1);
return readCodePoint();
}
/** @return -1 if the stream is exhausted. */
public int readByte() throws java.io.IOException {
if (bb.position()==maxLimit) {
if (maxLimit==0) return -1;
//refill the byte buffer
maxLimit=saneBulkRead(ba, 0);
bb.position(0);
bb.limit(1);
}
if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
return ((int) bb.get()) & 0xFF;
}
}
//////////////////////////////////////////////////////////////////////////////
import java.io.*;
public class BivalentInputStreamTest
{
static int numCharUnits=0;
public static byte[] buildTestArray() throws java.io.IOException {
ByteArrayOutputStream baos=new ByteArrayOutputStream();
DataOutputStream dos=new DataOutputStream(baos);
//write code points
String intro="Hello, World";
dos.writeChars(intro);
numCharUnits+=intro.length();
for (int i=0; i<0x110000; i+=128) {
//avoid writing lone surrogates
if (((i>=0xD800 && i<=0xDBFF) || (i>=0xDC00 && i<=0xDFFF))!=true) {
char[] chars=Character.toChars(i);
dos.writeChars(new String(chars));
numCharUnits+=chars.length;
}
}
//write binary data
for (int i=0; i<256; ++i) {
dos.writeByte(i);
}
dos.flush(); dos.close();
return baos.toByteArray();
}
public static void printByteArrayDifferences(byte[] array1, byte[] array2) {
System.out.println("array1.length="+array1.length+
"; array2.length="+array2.length);
byte[] smaller=array1, larger=array2;
if (array1.length>array2.length) { smaller=array2; larger=array1; }
for(int i=0; i<smaller.length; ++i) {
if (array1
!=array2)
System.out.println("position "+i+": "+(((int) array1) & 0xFF)+
" "+(((int) array2) & 0xFF));
}
for (int i=smaller.length; i<larger.length; ++i) {
System.out.println("position "+i+": "+(((int) larger) & 0xFF));
}
}
public static void main(String[] args) throws java.io.IOException {
byte[] ba=buildTestArray();
ByteArrayInputStream bais=new ByteArrayInputStream(ba);
BivalentInputStream in=new BivalentInputStream(bais, java.nio.charset.Charset.forName("UTF-16"));
ByteArrayOutputStream baos=new ByteArrayOutputStream();
DataOutputStream dos=new DataOutputStream(baos);
//read char data (for testing purposes using the stored number of char units)
for (int i=0; i<numCharUnits; ++i) {
char c=in.readChar();
dos.writeChar(c);
}
//read binary data
for (int i=0; i<256; ++i) {
dos.writeByte(in.readByte());
}
//Compare the arrays
dos.flush(); dos.close();
byte[] newBA=baos.toByteArray();
if (ba.equals(newBA)!=true) printByteArrayDifferences(ba, newBA);
}
}