Dual binary/character streams?

A

Adam Warner

Hi all,

Suppose a stream contains text and binary data. The text will describe how
many bytes to read as binary data before switching back to reading text.
It appears Java provides no library upon which to reasonably build this
functionality!

Let's make up an example:

"character data" #10 __________"character data continues"
^ ^
|10octets|

The token #10 means: read 10 bytes of binary data. Thereafter continue
reading characters in the default character set.

An InputStream supports reading binary data. But an InputStreamReader is
permitted to act like a BufferedReader: "To enable the efficient
conversion of bytes to characters, more bytes may be read ahead from the
underlying stream than are necessary to satisfy the current read operation."

Thus an InputStreamReader cannot be relied upon to just read a character.
It may read ahead, removing the binary data from the InputStream.

Is there a character reader for Java that only reads the number of bytes
necessary to satisfy a read() request?

Regards,
Adam
 
R

Roedy Green

Let's make up an example:

"character data" #10 __________"character data continues"
^ ^
|10octets|

you use a DataInputStream. You read the binary with readInt
readDouble etc.

You read the character data, presumably 8-bit encoded as bytes. then
convert the byte array to a string using the desired encoding.

// byte[] -> String
String t = new String( b , "Cp1252" /* encoding */ );

If you have control over the stream, you get the person sending it you
you to encode the strings in counted UTF-8 format. Then you can read
them easily with readUTF.
 
A

Adam Warner

you use a DataInputStream. You read the binary with readInt
readDouble etc.

You read the character data, presumably 8-bit encoded as bytes. then
convert the byte array to a string using the desired encoding.

Thanks for the suggestion Roedy. I'm attempting to avoid any presumption
about the default character set (it could for example be UTF-8 or UTF-16)
so this isn't a general solution.
// byte[] -> String
String t = new String( b , "Cp1252" /* encoding */ );

At this point one doesn't know where the characters terminate and the
binary data begins.

One could say InputStreamReader is missing a readByte() method. It is
permitted to read ahead bytes yet it provides no way to access
those subsequent bytes.

By reading ahead and not providing a readByte method the Java standard
library appears to provide no reasonable way to decode a char (in an
arbitrary character encoding) within a binary stream while preserving
the rest of the binary data.
If you have control over the stream, you get the person sending it you
you to encode the strings in counted UTF-8 format. Then you can read
them easily with readUTF.

The character encoding is not fixed. And readUTF is Java-specific junk.
It's impressive how Sun managed to come up with a way to waste 50% more
space than four byte encoded UTF-8 code points.

Regards,
Adam
 
R

Roedy Green

At this point one doesn't know where the characters terminate and the
binary data begins.

If you can't tell, your protocol is broken. You will have to do
something to fix it. I suggest using counted UTF strings.

Maybe you mean you have to KNOW the lengths in your code to read the
stream., that they are not embedded in the stream and there is no
format description in the stream.

There is no way you can process a stream without knowing the encoding.
The encoding may be 7-bit ASCII, but you still have to know what it
is.

You can COPY such a stream, but you can't process it.

The beauty of UTF-8 is that it works for any platform and you don't
have to customize it for different locales.

If this stream is a legacy, and you can't change its format at all,
and this stream was actually read and processed at one point in
history, there must be some hidden assumptions you can take advantage
of. e.g. null terminated strings, the encoding used, fixed lengths
of fields, a file header ...
 
C

Chris Uppal

Adam said:
Suppose a stream contains text and binary data. The text will describe how
many bytes to read as binary data before switching back to reading text.
It appears Java provides no library upon which to reasonably build this
functionality!

Does your format have a reliable way of spotting the end of a stream of
character data /without/ decoding it ? E.g. in HTTP the headers can specify
the length of the (binary) body, but the headers can be separated reliably from
the body before they are decoded. Or, failing that, is there a hard limit to
how many bytes of character data are allowed in one "chunk" (so that you can
make a copy of that data and decode it independently) ?

If not then the format is rather awkwardly designed, and you will have to mess
around with more complicated code to unravel it character-by-character I
suggest using a
java.nio.charset.CharsetDecoder directly.

BTW, since you will have to work character-by-character, even if you were able
to use a stock InputSteamReader (if it didn't read ahead), it wouldn't be
buying you much at all compared with using your own CharsetDecoder.

BTW2. don't forget that Unicode characters, unlike Java chars, are not limited
to 16bits. So one logical character of input may require two actual chars of
output.

-- chris
 
A

Adam Warner

If you can't tell, your protocol is broken.

No. You are simply unable to solve the stated issue: "By reading ahead and
not providing a readByte method the Java standard library appears to
provide no reasonable way to decode a char (in an arbitrary character
encoding) within a binary stream while preserving the rest of the binary
data."
You will have to do something to fix it. I suggest using counted UTF
strings.

Maybe you mean you have to KNOW the lengths in your code to read the
stream., that they are not embedded in the stream and there is no format
description in the stream.

There is no way you can process a stream without knowing the encoding.
The encoding may be 7-bit ASCII, but you still have to know what it is.

This is not the issue. InputStreamReader has a default encoding and a
named encoding can also be specified. Unfortunately it may read extra
bytes from the underlying binary stream without providing a way to access
them as binary data.
You can COPY such a stream, but you can't process it.

The beauty of UTF-8 is that it works for any platform and you don't have
to customize it for different locales.

If this stream is a legacy, and you can't change its format at all, and
this stream was actually read and processed at one point in history,
there must be some hidden assumptions you can take advantage of. e.g.
null terminated strings, the encoding used, fixed lengths of fields, a
file header ...

There is no hidden assumption. The decoding information is contained in
the character stream. Conceptually it's a type of bivalent stream:
<http://www.franz.com/support/documentation/7.0/doc/socket.htm#socket-characteristics-1>
("Bivalent means that the stream will accept text and binary stream
functions. That is, you can write-byte or write-char, read-byte or
read-char.")

The protocol is: Decode and interpret a string token. The interpretation
of the token determines whether the next datum in the stream will be read
as a character or a byte.

Given the specification of InputStreamReader this protocol appears to be
difficult to implement. A simple solution is unlikely.

Regards,
Adam
 
C

Chris Uppal

Adam said:
One could say InputStreamReader is missing a readByte() method. It is
permitted to read ahead bytes yet it provides no way to access
those subsequent bytes.

One other problem -- more than just being unable to retrieve bytes that it
has read ahead -- is that those bytes might form an invalid or illegal
sequences for the given encoder. Logically it should not throw an error until
it was asked for the "character" at the illegal position, but I bet it's not
implemented that way.

-- chris
 
A

Adam Warner

Does your format have a reliable way of spotting the end of a stream of
character data /without/ decoding it ?

No. While I can come up with a different format (e.g. encoding the binary
data in base 64) I'd like to solve the problem as specified.
E.g. in HTTP the headers can specify the length of the (binary) body,
but the headers can be separated reliably from the body before they are
decoded. Or, failing that, is there a hard limit to how many bytes of
character data are allowed in one "chunk" (so that you can make a copy
of that data and decode it independently) ?

Since I'll be supporting arbitrary precision integers I guess the
character data is effectively unlimited.
If not then the format is rather awkwardly designed, and you will have
to mess around with more complicated code to unravel it
character-by-character I suggest using a
java.nio.charset.CharsetDecoder directly.

The NIO could be helpful. But I still wouldn't know where to cut off a
chunk from the stream without potentially splitting a character and
breaking the decoding.
BTW, since you will have to work character-by-character, even if you
were able to use a stock InputSteamReader (if it didn't read ahead), it
wouldn't be buying you much at all compared with using your own
CharsetDecoder.

BTW2. don't forget that Unicode characters, unlike Java chars, are not
limited to 16bits. So one logical character of input may require two
actual chars of output.

Indeed. Java chars are not only sufficient for building code points but
also serve as input for decoding graphemes via IBM's ICU4J library:
<http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/BreakIterator.html>

Thanks for the ideas Chris.

Regards,
Adam
 
C

Chris Uppal

Adam said:
No. While I can come up with a different format (e.g. encoding the binary
data in base 64) I'd like to solve the problem as specified.

Hmm. I'm starting to think that you might want to take that option...

The NIO could be helpful. But I still wouldn't know where to cut off a
chunk from the stream without potentially splitting a character and
breaking the decoding.

What I had in mind was a simple loop where, at each step, you feed 1 byte to
the CharsetDecoder and get back 0, 1, or 2 chars.

Unfortunately I was wrong. Although the documentation doesn't say so, and
although the design is clearly set up to be used like that, it doesn't work.
At least the UTF-8 decoder doesn't work if used like that. It doesn't retain
enough state to remember that it has seen the start of an encoded character,
and so it cannot be trusted to decode sucessfully across buffer boundaries (I
don't know whether that's a bug or simply that it isn't expected to be able to
do so). So I think that the loop has to look more like

0) clear a small buffer
1) get the next byte
2) append it to the small buffer
3) attempt to decode that into up to 2 chars
4) if that works[*] then process the chars and goto (0)
5) goto (1)

and that -- when expressed using the magic of nio ByteBuffers and
CharBuffers -- looks as it'd be extremely messy...

([*] by "works" I mean produces at least 1 char)

-- chris
 
K

Knute Johnson

Adam said:
Hi all,

Suppose a stream contains text and binary data. The text will describe how
many bytes to read as binary data before switching back to reading text.
It appears Java provides no library upon which to reasonably build this
functionality!

Let's make up an example:

"character data" #10 __________"character data continues"
^ ^
|10octets|

The token #10 means: read 10 bytes of binary data. Thereafter continue
reading characters in the default character set.

An InputStream supports reading binary data. But an InputStreamReader is
permitted to act like a BufferedReader: "To enable the efficient
conversion of bytes to characters, more bytes may be read ahead from the
underlying stream than are necessary to satisfy the current read operation."

Thus an InputStreamReader cannot be relied upon to just read a character.
It may read ahead, removing the binary data from the InputStream.

Is there a character reader for Java that only reads the number of bytes
necessary to satisfy a read() request?

Regards,
Adam

I don't know why anybody would create a data file in this format but you
are going to have to read it with an InputStream not a Reader. So the
answer to your question is no! There must be some method of determining
when you have found a 'binary is coming tag' or nobody could decode this
data. Use in InputStream and look for the tag, collect your data and
proceed. What are you going to do with the binary data? Is it images
or something like that? Or is it going to be converted to characters too?
 
R

Roedy Green

No. While I can come up with a different format (e.g. encoding the binary
data in base 64) I'd like to solve the problem as specified.

You say you CAN tell the end in the DECODED stream but not in the byte
stream. How do you notice the end in the DECODED stream?
 
A

Adam Warner

Hmm. I'm starting to think that you might want to take that option...



What I had in mind was a simple loop where, at each step, you feed 1
byte to the CharsetDecoder and get back 0, 1, or 2 chars.

Nice idea that's unfortunately necessary because CharsetDecoder omits a
decodeChar() method.
Unfortunately I was wrong. Although the documentation doesn't say so,
and although the design is clearly set up to be used like that, it
doesn't work. At least the UTF-8 decoder doesn't work if used like that.
It doesn't retain enough state to remember that it has seen the start
of an encoded character, and so it cannot be trusted to decode
sucessfully across buffer boundaries (I don't know whether that's a bug
or simply that it isn't expected to be able to do so).

I've worked around this.
So I think that the loop has to look more like

0) clear a small buffer
1) get the next byte
2) append it to the small buffer
3) attempt to decode that into up to 2 chars
4) if that works[*] then process the chars and goto (0)
5) goto (1)

and that -- when expressed using the magic of nio ByteBuffers and
CharBuffers -- looks as it'd be extremely messy...

([*] by "works" I mean produces at least 1 char)

My approach is similar. I fill a byte buffer (currently 1024 bytes) using
a bulk read operation. Without any further copying I supply successive
windows of the byte buffer to the charset decoder which places the decoded
result into a char buffer of size 2. If there is no result the byte buffer
position is reset to its previous value and the buffer limit is increased
by 1 (leading to a visible byte window of 1, 2, 3, ... bytes). If the
limit exceeds the size of the byte buffer then the few bytes yet to be
decoded are copied back to the start of the byte buffer and the rest of
the byte buffer is filled via another bulk read operation. Eventually a
character is read.

There will be bugs in the implementation below. I have successfully run a
test class that includes a selection of Unicode code points and 256 bytes
of binary data. To execute:

javac BivalentInputStream.java BivalentInputStreamTest.java &&
java BivalentInputStreamTest

Many thanks for the feedback Chris.

Regards,
Adam


import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;

public class BivalentInputStream
{
public static int bufSize=1024;

private InputStream in;
private ByteBuffer bb=ByteBuffer.allocate(bufSize);
private byte[] ba=bb.array();
private int maxLimit;

private CharsetDecoder decoder;
private CharBuffer cb=CharBuffer.allocate(2); //support surrogate chars
private char[] ca=cb.array();


/** @return The number of bytes read into the buffer. */
private int saneBulkRead(byte[] b, int offset) throws java.io.IOException {
int brokenNumBytesRead=in.read(b, offset, b.length-offset);
if (brokenNumBytesRead==-1) return 0;
return brokenNumBytesRead;
}


public BivalentInputStream(InputStream in) throws java.io.IOException {
this.in=in;
maxLimit=saneBulkRead(ba, 0);
bb.limit(1);
decoder=Charset.defaultCharset().newDecoder();
}

public BivalentInputStream(InputStream in, Charset cs) throws java.io.IOException {
this.in=in;
maxLimit=saneBulkRead(ba, 0);
bb.limit(1);
this.decoder=cs.newDecoder();
}


private char cachedSurrogate;
private boolean storedSurrogate=false;

/** @return '\uFFFF' if the stream is exhausted or the remaining bytes
do not comprise a 16-bit char. */
public char readChar() throws java.io.IOException {
if (storedSurrogate==true) {
storedSurrogate=false;
return cachedSurrogate;
}
int codePoint=readCodePoint();
if (codePoint==-1) return '\uFFFF';
if (codePoint>0xFFFF) {
char[] chars=Character.toChars(codePoint);
storedSurrogate=true;
cachedSurrogate=chars[1];
return chars[0];
}
return (char) codePoint;
}


/** @return -1 if the stream is exhausted or the remaining bytes
do not comprise a Unicode code point. */
public int readCodePoint() throws java.io.IOException {
//Buffer refill logic
if (bb.position()==maxLimit) {
if (maxLimit==0) return -1;
//refill the byte buffer after moving the remaining bytes up to position 0
int remainingBytes=maxLimit-bb.position();
System.arraycopy(ba, bb.position(), ba, 0, remainingBytes);
maxLimit=saneBulkRead(ba, remainingBytes);
if (maxLimit==0) return -1; //remaining bytes do not comprise a code point
maxLimit+=remainingBytes;
bb.position(0);
bb.limit(remainingBytes+1);
}

cb.position(0);
int bbStartPos=bb.position();
decoder.reset();
CoderResult result=decoder.decode(bb, cb, true);
decoder.flush(cb);
if (result==CoderResult.UNDERFLOW) {
if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
return Character.codePointAt(ca, 0);
}
bb.position(bbStartPos);
bb.limit(bb.limit()+1);
return readCodePoint();
}


/** @return -1 if the stream is exhausted. */
public int readByte() throws java.io.IOException {
if (bb.position()==maxLimit) {
if (maxLimit==0) return -1;
//refill the byte buffer
maxLimit=saneBulkRead(ba, 0);
bb.position(0);
bb.limit(1);
}
if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
return ((int) bb.get()) & 0xFF;
}
}

//////////////////////////////////////////////////////////////////////////////

import java.io.*;

public class BivalentInputStreamTest
{
static int numCharUnits=0;

public static byte[] buildTestArray() throws java.io.IOException {
ByteArrayOutputStream baos=new ByteArrayOutputStream();
DataOutputStream dos=new DataOutputStream(baos);

//write code points
String intro="Hello, World";
dos.writeChars(intro);
numCharUnits+=intro.length();

for (int i=0; i<0x110000; i+=128) {
//avoid writing lone surrogates
if (((i>=0xD800 && i<=0xDBFF) || (i>=0xDC00 && i<=0xDFFF))!=true) {
char[] chars=Character.toChars(i);
dos.writeChars(new String(chars));
numCharUnits+=chars.length;
}
}
//write binary data
for (int i=0; i<256; ++i) {
dos.writeByte(i);
}

dos.flush(); dos.close();
return baos.toByteArray();
}


public static void printByteArrayDifferences(byte[] array1, byte[] array2) {
System.out.println("array1.length="+array1.length+
"; array2.length="+array2.length);
byte[] smaller=array1, larger=array2;
if (array1.length>array2.length) { smaller=array2; larger=array1; }
for(int i=0; i<smaller.length; ++i) {
if (array1!=array2)
System.out.println("position "+i+": "+(((int) array1) & 0xFF)+
" "+(((int) array2) & 0xFF));
}
for (int i=smaller.length; i<larger.length; ++i) {
System.out.println("position "+i+": "+(((int) larger) & 0xFF));
}
}


public static void main(String[] args) throws java.io.IOException {
byte[] ba=buildTestArray();
ByteArrayInputStream bais=new ByteArrayInputStream(ba);
BivalentInputStream in=new BivalentInputStream(bais, java.nio.charset.Charset.forName("UTF-16"));
ByteArrayOutputStream baos=new ByteArrayOutputStream();
DataOutputStream dos=new DataOutputStream(baos);
//read char data (for testing purposes using the stored number of char units)
for (int i=0; i<numCharUnits; ++i) {
char c=in.readChar();
dos.writeChar(c);
}
//read binary data
for (int i=0; i<256; ++i) {
dos.writeByte(in.readByte());
}
//Compare the arrays
dos.flush(); dos.close();
byte[] newBA=baos.toByteArray();
if (ba.equals(newBA)!=true) printByteArrayDifferences(ba, newBA);
}
}
 
A

Adam Warner

You say you CAN tell the end in the DECODED stream but not in the byte
stream. How do you notice the end in the DECODED stream?

If I call readCodePoint() upon a BivalentInputStream with valid character
data then a Unicode code point is returned or -1 to signal the end of the
stream. This is the first way of noticing the end of the decoded stream.

Alternatively I could decide that a newline code point terminates the end
of decoding. Again this is easy to detect.

More complicated protocols are possible. A programming language could
provide syntax to switch to binary decoding to reduce overhead when
transferring code and data over a network. A kind of Binary XML could use
this approach to switch to binary encoding. A tag such as
<binary octets="12345"/> could read 12345 octets of binary data
immediately following the closing > before switching back to reading text.

A bivalent approach avoids the overhead of encoding binary data in the
current character set and the high CPU burden of compressing that data for
transmission and decompressing it again at the other end and finally
translating the characters back to binary data. One clearly needs control
over the whole communication process because the transformed data is
unlikely to be legal text unless the character set is a legacy encoding
such as ISO-8859-1. And even if the resulting text is legal the binary
data will be corrupted by different operating system newline conventions.

Regards,
Adam
 
R

Roedy Green

No. While I can come up with a different format (e.g. encoding the binary
data in base 64) I'd like to solve the problem as specified.

If you use counted UTF, the problem goes away. You don't have a slow
Mickey Mouse solution. The String is handled with equal ease to any
binary field. Why goof around with bailing wire?

see DataOutputStream.writeUTF and DataInputStream. readUTF
 
R

Roedy Green

BivalentInputStream

I am not familiar with that class. Further I have never heard the
term bivalent used outside the chemistry or genetics contexts.

What do you mean by "bivalent" in terms of datastreams? Do you just
mean having two different encodings, e.g. encoded char and binary and
some mechanism to toggle?
 
C

Chris Uppal

Adam said:
There will be bugs in the implementation below.

You might like a couple of test inputs, The following byte array defines a
sequence of 4 Unicode code points, or 5 Java chars (sorry about the layout
mangling).

Charset utf8 = Charset.forName("UTF-8");
byte[] bytes = new byte[] {
0x32, // = U+000032
(byte)0xD0, (byte)0xB0, // = U+000430
(byte)0xE4, (byte)0xBA, (byte)0x8C, // = U+004E8C
(byte)0xF0, (byte)0x90, (byte)0x8C, (byte)0x82 // = U+010302
)

Also this sequence defines an /invalid/ UTF-8 sequence:
byte[] bytes = new byte[] {
(byte)0xB0, (byte)0xD0 // = invalid
};

A couple of comments, if you want 'em:

private int saneBulkRead(byte[] b, int offset) throws java.io.IOException {
int brokenNumBytesRead=in.read(b, offset,

I see you prefer self-documenting code ;-) Nice...

public int readCodePoint() throws java.io.IOException {
[...]
if (result==CoderResult.UNDERFLOW) {
if (bb.limit()<maxLimit) bb.limit(bb.limit()+1);
return Character.codePointAt(ca, 0);
}
bb.position(bbStartPos);
bb.limit(bb.limit()+1);
return readCodePoint();
}

If the input data is mangled, then 'result' will be isMalformed() and no amount
of extra data added to the end will fix it, so in that case the recursion will
continue more-or-less indefinitely.

I /think/ you may also have a problem with the bb.limit(..) line. It assumes
that there is enough space in bb which I don't think is necessarily the case.

-- chris
 
A

Adam Warner

You might like a couple of test inputs, The following byte array defines
a sequence of 4 Unicode code points, or 5 Java chars (sorry about the
layout mangling).

Many thanks. I do have to improve handling of malformed data.
A couple of comments, if you want 'em:

private int saneBulkRead(byte[] b, int offset) throws
java.io.IOException {
int brokenNumBytesRead=in.read(b, offset,

I see you prefer self-documenting code ;-) Nice...

An Enterprise API isn't complete until the documentation for x.plus(y)
reads: /** @return The sum of x and y, unless the sum is 42 then -1 is
returned. */

java.io.InputStream.read(byte[] b, int off, int len) returns the number of
bytes written to the byte array. Except when it doesn't. A better language
would support seamless multiple return values and their efficient
implementation. If Java had multiple return values the first return value
for this method could simply be the number of bytes written to the byte
array. The second return value, to be optionally captured, could be a
boolean denoting the end of stream. Instead of conflating two return
values there could also be a separate isEndofStream() method.

As JVMs become capable of stack allocating many new objects via escape
analysis there's potential for the efficient return of multiple values
within an explicit new array. If Java the language is changed to support
seamless multiple return values (like the recent introduction of variable
arguments on the input side) then more consistent libraries are likely.

Regards,
Adam
 
R

Roedy Green

As JVMs become capable of stack allocating many new objects via escape
analysis there's potential for the efficient return of multiple values
within an explicit new array. If Java the language is changed to support
seamless multiple return values (like the recent introduction of variable
arguments on the input side) then more consistent libraries are likely.

Java the language is fine. The Jet people automatically allocate some
objects on the stack. Allocating objects there would likely require an
overhaul of the JVM.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top