Mixing text and binary I/O

Ivan Voras · Aug 25, 2006

In implementing a network protocol, there's a text (ASCII) phase and a
binary phase. The ideal thing to use would be BufferedReader, but it
doesn't allow reading raw bytes. The next best thing (though slower)
would be DataInputStream, but its readLine() method is deprecated for
silly reasons (IMO). Any other suggestions?

Mike Schilling · Aug 25, 2006

Ivan Voras said:
In implementing a network protocol, there's a text (ASCII) phase and a
binary phase. The ideal thing to use would be BufferedReader, but it
doesn't allow reading raw bytes. The next best thing (though slower)
would be DataInputStream, but its readLine() method is deprecated for
silly reasons (IMO). Any other suggestions?

The simplest thing would be to read each message into a byte array and
convert the bytes in the text portion appropriately (using, for instance, an
InputStream Reader on top of a ByteArrayInputStream.)

Ivan Voras · Aug 26, 2006

Mike said:
The simplest thing would be to read each message into a byte array and
convert the bytes in the text portion appropriately (using, for instance, an
InputStream Reader on top of a ByteArrayInputStream.)

Rolling my own is not a problem, it just seems like it should belong in
the basic library.

Mike Schilling · Aug 26, 2006

Ivan Voras said:
Rolling my own is not a problem, it just seems like it should belong in
the basic library.

You'd need to define a standard way to express the boundary between the
binary and text portions.

frankgerlach · Aug 26, 2006

When converting a java char or String to bytes, you should *always*
specify the encoding, which can be "UTF-8", "ISO-8859-1", "ASCII" etc.
*Never* use the default encoding - this is system dependent. Use
String.getBytes("ASCII"), do not use String.getBytes() !

Chris Uppal · Aug 26, 2006

Ivan said:
Rolling my own is not a problem, it just seems like it should belong in
the basic library.

It would require a fairly major redesign of the standard library -- there is no
way for the current application of the Decorator pattern to express the idea
that a decorator is responsible for pushing buffered-but-unused data back
onto the underlying stream.

A pity, really. If the design were fixed (and they might as well get buffering
and random access fixed too, while they were at it), then many things would
become much easier.

-- chris

Ivan Voras · Aug 26, 2006

Mike said:
You'd need to define a standard way to express the boundary between the
binary and text portions.

Um, "this byte position here" (i.e. ftell()) is good enough, no need to
overengineer it.

In case of complex encodings like UTF-8, I'd expect (and will probably
create for my case) its behaviour to be like this:

- Backed by a buffer (the usual way, probably byte[])
- readByte() reads from the buffer, handles buffering of new data, etc.
- readChar() reads as much bytes as it needs to reconstitute a
character, in case of UTF-8 it could be one or several - it doesn't
matter. If it encounters an invalid byte (by the expectations set by
used encoding), raise proper exception because it's an encoding error in
the stream.
- Introduce private or protected pushByte() and pushChar() that do the
reverse of readXXX, on the buffer. "Fixup" the fact that one character
can have more bytes by initially making the buffer 4+ bytes longer, but
don't use this extra space when filling the buffer in readByte(). Like
in C, make pushXXX work only for a single byte/character.
- Modify readLine() to use readChar(), reads characters until CR+LF; can
use existing logic that reads one char after CR to see if it's LF and
push it back if it isn't.
- Every other readXXX method uses readByte() as usual.

The intended result: freely mix bytes and characters. In the extreme
(but supported!) case, the stream can have a UTF-8 character (encoded by
one or several bytes) followed by a "raw" byte, followed by a UTF-8
character, etc. The programmer is responsible to know how the stream is
formatted.

Soren Kuula · Aug 26, 2006

Ivan said:
Mike Schilling wrote:

In case of complex encodings like UTF-8, I'd expect (and will probably
create for my case) its behaviour to be like this:

- Backed by a buffer (the usual way, probably byte[])
- readByte() reads from the buffer, handles buffering of new data, etc.
- readChar() reads as much bytes as it needs to reconstitute a
character, in case of UTF-8 it could be one or several - it doesn't
matter. If it encounters an invalid byte (by the expectations set by
used encoding), raise proper exception because it's an encoding error in
the stream.

I think that java.nio.CharsetEncoder and CharsetDecoder do just that.

BTW, I agree with Frank that you should take charcter encoding
seriously!! Do not assume anything, and do not use defaults. Otherwise,
you will end up with something that never really works -- in other
places than yours, on other computers than yours.

SÃ¸ren

Mike Schilling · Aug 27, 2006

Ivan Voras said:
In case of complex encodings like UTF-8, I'd expect (and will probably
create for my case) its behaviour to be like this:

- Backed by a buffer (the usual way, probably byte[])

In fact, I think you can build it on top of an InputStream, which is more
flexible and more general, since all you need is a source of bytes.

- readByte() reads from the buffer, handles buffering of new data, etc.

Let the underlying stream handle buffering.

- readChar() reads as much bytes as it needs to reconstitute a
character, in case of UTF-8 it could be one or several - it doesn't
matter. If it encounters an invalid byte (by the expectations set by
used encoding), raise proper exception because it's an encoding error in
the stream.

I don't know how to build this in general. It's mostly straightforward to
build for a specific encoding, say UTF-8, but CharsetDecoder has no method
that means "decode exactly one character". (I suppose you could give it one
byte, then two, then three, etc. until it stoips returning a failure status,
but that seems inelegant.) Even in UTF-8, you get oddities where a
codepoint > FFFF returns two characters; returning the first consumes 4
bytes, and returning the second consumes 0 bytes. In other words, you'd
have to be careful with logic like "I know that this set of characters
occupies bytes 3-10, and I've processed all of them, so I'll switch to
reading bytes again."

- Introduce private or protected pushByte() and pushChar() that do the
reverse of readXXX, on the buffer. "Fixup" the fact that one character
can have more bytes by initially making the buffer 4+ bytes longer, but
don't use this extra space when filling the buffer in readByte(). Like
in C, make pushXXX work only for a single byte/character.
- Modify readLine() to use readChar(), reads characters until CR+LF; can
use existing logic that reads one char after CR to see if it's LF and
push it back if it isn't.

More precisely, reads until CR, LF, or CRLF. You're right that pushing back
a non-LF after CR is easy enough.

Ivan Voras · Aug 27, 2006

Mike said:
I don't know how to build this in general. It's mostly straightforward to
build for a specific encoding, say UTF-8, but CharsetDecoder has no method
that means "decode exactly one character". (I suppose you could give it one

Hmm, ok. This is a slight problem (and IMO a candidate for rectifying),
but for my current purpose, I can limit it to UTF-8 and use the sort-of
implementation in DataInputStream. I can always point people to file a
problem report with Java if they need 4-byte characters

(just kidding)

Chris Uppal · Aug 27, 2006

Ivan said:
Hmm, ok. This is a slight problem (and IMO a candidate for rectifying),
but for my current purpose, I can limit it to UTF-8 and use the sort-of
implementation in DataInputStream.

Based on a previous attempt to use CharsetDecoder "raw", I suggest that you at
least consider not using one at all, but doing your own UTF-8 {en/de}code logic
instead.

-- chris

Mike Schilling · Aug 27, 2006

Chris Uppal said:
Based on a previous attempt to use CharsetDecoder "raw", I suggest that
you at
least consider not using one at all, but doing your own UTF-8 {en/de}code
logic
instead.

Here's a place where I wish Java had output parameters; the signature I'd
want for readChar is

/** @returns a character in the range 0 to 65535, or -1 at EOF
* @param moreDecoded returns true if another character is available
without consuming more bytes
*/
int readChar(out boolean moreDecoded);

As it is, I suppose a moreDecoded() method is the least of evils, i.e.
better than forcing the client to check that the returned character is in
the range D800-DBFF.

Ivan Voras · Aug 27, 2006

Chris said:
Based on a previous attempt to use CharsetDecoder "raw", I suggest that you at
least consider not using one at all, but doing your own UTF-8 {en/de}code logic
instead.

Agreed. The "sort-of" implementation in DataInputStream is good enough
for me to adapt it.

Stefan Ram · Aug 27, 2006

Mike Schilling said:
/** @returns a character in the range 0 to 65535, or -1 at EOF
* @param moreDecoded returns true if another character is available
without consuming more bytes
*/
int readChar(out boolean moreDecoded);
As it is, I suppose a moreDecoded() method is the least of evils, i.e.
better than forcing the client to check that the returned character is in
the range D800-DBFF.

Possibly, the Java SE way would be to implement the following interface.

http://download.java.net/jdk7/docs/api/java/util/Iterator.html

Mike Schilling · Aug 28, 2006

Stefan Ram said:
Possibly, the Java SE way would be to implement the following interface.

I don't think so. An Iterator knows when there's nothing more to return.
Part of the assumption here is that the client knows (based on the protocol
definition) how many characters to ask for before switching back to binary.
There's nothing to tell the Iterator that.

Unless you mean that each call to readChar() returns an
Iterator<Character>. But that seems awfully cumbersome, both for the
implementation, which has to wrap each decoded char into a Character and
then wrap *that* in an Iterator, and for the client which has to unwrap each
of those..

Stefan Ram · Aug 28, 2006

Mike Schilling said:
I don't think so. An Iterator knows when there's nothing more to return.
Part of the assumption here is that the client knows (based on the protocol
definition) how many characters to ask for before switching back to binary.
There's nothing to tell the Iterator that.

I have not read the whole thread, but was just responding
to what I have quoted. If one wants something like

int readChar( out boolean moreDecoded )

, then this can be done with an iterator.

The »out« means that »readChar« tells its client whether there
are more characters by this out-parameter. When you now say
that the client already knows how many charaters will be
coming, you might be talking about something else beyond the
scope of my answer. I was just refering to

int readChar( out boolean moreDecoded )

in isolation.

Unless you mean that each call to readChar() returns an
Iterator<Character>. But that seems awfully cumbersome, both
for the implementation, which has to wrap each decoded char
into a Character and then wrap *that* in an Iterator, and for
the client which has to unwrap each of those..

For a sequence of multiple iterations using an iterator, there
is no need to create a new iterator object for each iteration.
The same iterator object might be reused using "set" instead
of "wrap". This, however, is not possible with
java.lang.Character, because it is immutable.

An iterator object also might implement other methods than
those of the interface "Iterator" to read information from
its client.

Dale King · Sep 1, 2006

Ivan said:
In implementing a network protocol, there's a text (ASCII) phase and a
binary phase. The ideal thing to use would be BufferedReader, but it
doesn't allow reading raw bytes. The next best thing (though slower)
would be DataInputStream, but its readLine() method is deprecated for
silly reasons (IMO). Any other suggestions?

In general there is no good way to do this using an InputStreamReader
wrapping the raw InputStream. The only way would require your protocol
to contain information that tells you how many of the following bytes
are part of the ASCII phase. You then read out those bytes, wrap it in a
ByteArrayInputStream and InputStreamReader. You cannot just read from
the InputStreamReader and then go back to InputStream. The problem is
that the character decoder can buffer up a few bytes and read ahead into
the InputStream.

You have to look beyond the original I/O classes and look into the new
I/O (NIO) classes that were introduced in JDK1.4. They will be able to
handle this better because the buffering can be more explicit. You can
use a ByteBuffer from which a CharsetDecoder extracts bytes. But it will
not have the problem of reading too far, becuase it can look into the
ByteBuffer without gobbling up the bytes.

See the NIO documentation.

Mixing BufferedReader with DataInputStream	1	Nov 20, 2003
Binary and text file performance	3	May 9, 2005
Mixing buffered and unbuffered I/O to support CONNECT method?	0	May 19, 2005
Binary File I/O and ^M	3	Nov 19, 2005
text and binary files confusion	10	Mar 13, 2006
How to bypass Windows 'cooking' the I/O? (One more time, please) II	2	Jul 7, 2008
Problem Writing Binary Data Stream To File	12	May 17, 2006
Buffer pair for lexical analysis of raw binary data	3	Jun 27, 2009

Mixing text and binary I/O

Ivan Voras

Mike Schilling

Ivan Voras

Mike Schilling

frankgerlach

Chris Uppal

Ivan Voras

Soren Kuula

Mike Schilling

Ivan Voras

Chris Uppal

Mike Schilling

Ivan Voras

Stefan Ram

Mike Schilling

Stefan Ram

Dale King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads