Stream, Reader and text vs binary

R

Russell Wallace

Suppose one needs to both store (in a file) and transmit (via sockets)
data that will be mostly text, but with the occasional chunk of binary
(e.g. GIF images).

It seems to me that there are three possible ways:

1) Use a Reader (intended for text) and write the binary data directly
as 16 bits to a character.

I assume this _won't_ work, at least not reliably, because various
translations will be done that would mess up the binary data?

2) Use a Reader (intended for text) and encode the binary data as text
in hex, base64 or similar. This would work, though I was hoping for a
more elegant solution.

3) Use a Stream (intended for binary) and write strings as sequences of
16-bit integers.

Is it safe to do this? That is, if you put a Java String through a
channel that treats it as a literal sequence of 16-bit integers, are you
guaranteed to get the same character sequence out the other end? Or are
there Unicode complications, bank switching to squeeze different chunks
of the 32-bit code point space into the space of 16 bit Java characters,
that sort of thing that might mean (char)1234 on system A doesn't mean
the same character as (char)1234 on system B?

In general, what's the recommended way to do this - what do people
normally do if they want to put images in an XML file, say? Is there a
fourth way I haven't thought of?

Thanks,
 
T

Thomas Hawtin

Russell said:
Suppose one needs to both store (in a file) and transmit (via sockets)
data that will be mostly text, but with the occasional chunk of binary
(e.g. GIF images).

It seems to me that there are three possible ways:

1) Use a Reader (intended for text) and write the binary data directly
as 16 bits to a character.

I assume this _won't_ work, at least not reliably, because various
translations will be done that would mess up the binary data?

If you use a Reader you will need to decide how to encode character data
onto the stream. Beware, much of the Java library is booby trapped. For
instance, if you used java.io.InputStreamReader(InputStream), then you
are leaving the library to make up the character encoding decision for
you. In this case, it uses whatever the machine happens to be set to
use. If you choose, say UTF-8, then every value of char will be preserved.
2) Use a Reader (intended for text) and encode the binary data as text
in hex, base64 or similar. This would work, though I was hoping for a
more elegant solution.

No, not elegant.
3) Use a Stream (intended for binary) and write strings as sequences of
16-bit integers.

Is it safe to do this? That is, if you put a Java String through a

That should work. char is a 16-bit value. UTF-8 would be more conventional.
channel that treats it as a literal sequence of 16-bit integers, are you
guaranteed to get the same character sequence out the other end? Or are
there Unicode complications, bank switching to squeeze different chunks
of the 32-bit code point space into the space of 16 bit Java characters,
that sort of thing that might mean (char)1234 on system A doesn't mean
the same character as (char)1234 on system B?

There are char values that represent surrogate pairs. However, the
Unicode code-points they represent are above 0x10000. So there should be
no loss of information (although not every sequence of octets represent
valid UTF-8).
In general, what's the recommended way to do this - what do people
normally do if they want to put images in an XML file, say? Is there a
fourth way I haven't thought of?

I believe XML either uses out-of-channel binary data (XHTML img, for
instance) or Base64 encoding. You can have a perfectly valid XML
document that is just a Base64 blob between a pair of tags. XML does not
necessarily mean interoperable.

Much better is to use a binary data format, and encode Strings as UTF-8.
You could even cheat and use serialisation, if you don't mind a
Java-only protocol.

Tom Hawtin
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Russell said:
Suppose one needs to both store (in a file) and transmit (via sockets)
data that will be mostly text, but with the occasional chunk of binary
(e.g. GIF images).

It seems to me that there are three possible ways:

1) Use a Reader (intended for text) and write the binary data directly
as 16 bits to a character.

I assume this _won't_ work, at least not reliably, because various
translations will be done that would mess up the binary data?

2) Use a Reader (intended for text) and encode the binary data as text
in hex, base64 or similar. This would work, though I was hoping for a
more elegant solution.

3) Use a Stream (intended for binary) and write strings as sequences of
16-bit integers.

I would suggest:

4) use DataInputStream/DataOutputStream

5) have both InputStream/OutputStream and BufferedReader/PrintWriter and
a protocol that enables both ends to switch between them

Arne
 
C

Chris

Russell said:
Suppose one needs to both store (in a file) and transmit (via sockets)
data that will be mostly text, but with the occasional chunk of binary
(e.g. GIF images).

It seems to me that there are three possible ways:

1) Use a Reader (intended for text) and write the binary data directly
as 16 bits to a character.

I assume this _won't_ work, at least not reliably, because various
translations will be done that would mess up the binary data?

2) Use a Reader (intended for text) and encode the binary data as text
in hex, base64 or similar. This would work, though I was hoping for a
more elegant solution.

3) Use a Stream (intended for binary) and write strings as sequences of
16-bit integers.

Is it safe to do this? That is, if you put a Java String through a
channel that treats it as a literal sequence of 16-bit integers, are you
guaranteed to get the same character sequence out the other end? Or are
there Unicode complications, bank switching to squeeze different chunks
of the 32-bit code point space into the space of 16 bit Java characters,
that sort of thing that might mean (char)1234 on system A doesn't mean
the same character as (char)1234 on system B?

In general, what's the recommended way to do this - what do people
normally do if they want to put images in an XML file, say? Is there a
fourth way I haven't thought of?

You should encode the text and transmit everything as a stream of bytes.
There are methods built in to Java to handle this for you. The methods
are reliable, reversable, and they will encode most western characters
in a single byte. The best encoding to use is UTF-8, because it works
with all characters in Unicode in a clean way.

Example:

String str = "My String";
FileOutputStream fos = new FileOutputStream("/myfile.out");
OutputStreamWriter writer = new OutputStreamWriter(fos, "UTF-8");
writer.write(str);
writer.flush();

// write your images to the FileOutputStream directly and
// bypass the writer
 
T

Thomas Hawtin

Arne said:
4) use DataInputStream/DataOutputStream

writeUTF/readUTF only allows strings of up to 65535 bytes of modified
UTF-8. (If your String was all NUL characters, then it could be at most
65535/3 characters long).
5) have both InputStream/OutputStream and BufferedReader/PrintWriter and
a protocol that enables both ends to switch between them

You would have to be surprisingly careful getting that to work. The
Readers/Writers will probably over read/under write, so switching will
be difficult.

Tom Hawtin
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Thomas said:
writeUTF/readUTF only allows strings of up to 65535 bytes of modified
UTF-8. (If your String was all NUL characters, then it could be at most
65535/3 characters long).

I can usually live with lines smaller than that. Maybe the original
poster can too.
You would have to be surprisingly careful getting that to work. The
Readers/Writers will probably over read/under write, so switching will
be difficult.

I am not so worried about the output - flush should do that.

But maybe it would be wise on the input side to only use
InputStreamReader and not BufferedReader.

Arne
 
T

Thomas Hawtin

Arne said:
I can usually live with lines smaller than that. Maybe the original
poster can too.

In this day and age, I'd find it very surprising to come across such a
limitation. Several years I once became very unpopular because an
implementation of an API I was using couldn't cope with strings of more
than a certain length.
I am not so worried about the output - flush should do that.

So long as you don't mind not inconsiderable inefficiency. And don't
forget to flush every time.
But maybe it would be wise on the input side to only use
InputStreamReader and not BufferedReader.

Still wont work consistently.

Tom Hawtin
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Thomas said:
In this day and age, I'd find it very surprising to come across such a
limitation. Several years I once became very unpopular because an
implementation of an API I was using couldn't cope with strings of more
than a certain length.

You did notice that I am talking about lines - not text fragments ?
So long as you don't mind not inconsiderable inefficiency. And don't
forget to flush every time.

Only when switching mode.
Still wont work consistently.

Why ?

Arne
 
T

Thomas Hawtin

Arne said:
Only when switching mode.

Mode switching may be frequent. Particularly if it is dealing with lots
of small sections of text.

Even if you ask for a single character, Sun's implementation attempts to
grab a block of three bytes.

Tom Hawtin
 
R

Russell Wallace

Thanks to everyone who replied!
I can usually live with lines smaller than that. Maybe the original
poster can too.

I'm of the school of thought that says hardcoded limits are at best a
venial sin; but I can easily roll my own UTF-8 methods (or just use a
Deflater for better compression), so DataInputStream/DataOutputStream
look like the way to go.

Hmm, suppose one were to wrap said UTF-8 methods in a class
ModifiedDataOutputStream extends DataOutputStream... is there a
convention as to what actual name to use for ModifiedDataOutputStream?
 
T

Thomas Hawtin

Russell said:
Hmm, suppose one were to wrap said UTF-8 methods in a class
ModifiedDataOutputStream extends DataOutputStream... is there a
convention as to what actual name to use for ModifiedDataOutputStream?

Privately in ObjectOutputStream writeLongUTF is used for eight byte
(long) length followed by modified UTF-8 body, and writeString to decide
whether to use that or the old format (signified by a type byte).

Tom Hawtin
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Thomas said:
Even if you ask for a single character, Sun's implementation attempts to
grab a block of three bytes.

Ouch.

That is not very friendly towards other using the same stream.

#5 is out.

Arne
 
M

Mike Schilling

Arne Vajhøj said:
Ouch.

That is not very friendly towards other using the same stream.

#5 is out.

The thing is, any text data within the stream should be clearly delimited
(either with markers or by length.) It's simple enough to read it into a
byte array, wrap that with a ByteArrayInputStream, and read *that* with an
InputStreamReader.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top