.read() returns a char why?

J

JM

Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

.... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

I am engineer and not a comp.sci so I'd appreciate some patience in
your reply.

Jonathan
 
C

Chris Dollin

JM said:
The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

Java's chars have always been two bytes, so as to store 16-bit
Unicode characters.

(We'll pass quietly over the problems with Unicode now needing more than
16 bits for an unpacked character.)
 
M

Mike Schilling

JM said:
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1.

That's exactly right. If it returned a char, there would be no
"illegal" value left to indicate EOF.
Incidentally when did the character
(singular) become two bytes?

A char in Java is a 16-bit unicode (technically UTF-16) character, not
a byte.
 
P

Patricia Shanahan

JM said:
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

Yes, read returns a wider type than char so that there is a spare value
to represent end-of-stream.

One of the continuing trends in computing has been increasing numbers of
bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
bits.

Patricia
 
L

Lew

JM said:
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

.... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1.

It allows any value in the range of char to be represented as a positive
value. -1 is therefore guaranteed to be distinct from any valid value.

If you return a char, you cannot get the value 32768 or larger.
Incidentally when did the character (singular) become two bytes?

In Java's case, with the invention of Java.
 
R

Roedy Green

Incidentally when did the character
(singular) become two bytes?

with Java 1.0. C++ is in transition from 8 to 16.

It is now much more common to have a document containing multiple
languages. You can't encode it with only 8-bits per char. So Java
from day one used Unicode, which has 16-bits per char. Unicode-16 was
even big enough to include Chinese. However, Unicode has since been
extended to 32-bits to allow Ugaritic (cuneiform), musical symbols,
Cypriot etc. Java has somewhat bailing wire support for 32-bit
Unicode.

See http://mindprod.com/jgloss/unicode.html

Of course this would make documents on average twice as big as they
used to be. So UTF-8 was invented to make simple documents almost as
compact as if they have been encoded with an 8-bit national encoding.

see http://mindprod.com/jgloss/utf.html

Encoding is about how documents are encoded which is very complicated
and varied to deal with interchange with other computer languages and
legacy applications. Internally they are all stored simply in
Unicode-16.

See http://mindprod.com/jgloss/encoding.html
 
J

John W. Kennedy

Patricia said:
One of the continuing trends in computing has been increasing numbers of
bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
bits.

Not if you go back far enough, though. The IBM 650 took 14 bits to
represent a character (double bi-quinary), and its market successor, the
707x series, took 10 (double 2-of-5).
 
R

Roedy Green

Not if you go back far enough, though. The IBM 650 took 14 bits to
represent a character (double bi-quinary), and its market successor, the
707x series, took 10 (double 2-of-5).

In the olden days, each site would invent its own private 6-bit
encoding. I recall sitting with Vern Detwiler (later of MacDonald
Detwiler) looking at this new fangled 7-bit ASCII code and playing
with how we might make UBC's 6-bit code somewhat ASCII compatible for
the new IBM 7044. We had to decide what characters to include. Back
then popular characters included the word mark and record mark.

Later with the IBM 360 we had ENORMOUS 8-bit EBCDIC character sets
that came in a zillion variants. You still constrained yourself mainly
to upper case because printers used a rotating chain or band of
pre-formed characters, and extra chars slowed it down drastically.
 
J

JM

That's exactly right. If it returned a char, there would be no
"illegal" value left to indicate EOF.


A char inJavais a 16-bit unicode (technically UTF-16) character, not
a byte.

Many thanks for everyone's replied. Now what does not make sense is
when I call BufferedWriter.write(int) only one 8 bit byte gets
written.

BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
bw.write(1);
bw.write(256);
bw.close();
System.exit(0);

Creates a file of length 2 (bytes) containing
01
3F
in file "a" and not 16 bits.

Makes no sense to me.

Jonathan
 
L

Lew

JM said:
BufferedWriter bw = new BufferedWriter(new FileWriter("a"));

Don't use TAB characters in Usenet listings. It makes them very hard to read.
bw.write(1);
bw.write(256);
bw.close();
System.exit(0);

Creates a file of length 2 (bytes) containing
01
3F
in file "a" and not 16 bits.

Makes no sense to me.

What is the default character encoding for your platform?

The Writer will translate the String into that encoding unless you specify a
different one. Many encodings use only one byte per character, or one per the
each of the most common characters. It seems that UTF-16 is not your default
encoding for files, eh?

Google for "character encoding" and "Unicode", and read the material about
these concepts on java.sun.com, then ask about what is left out in those
references.
 
M

Mike Schilling

JM said:
Many thanks for everyone's replied. Now what does not make sense is
when I call BufferedWriter.write(int) only one 8 bit byte gets
written.

BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
bw.write(1);
bw.write(256);
bw.close();
System.exit(0);

Creates a file of length 2 (bytes) containing
01
3F

note that 3F isn't 256; it's an ASCII question mark (?). I'll explain
why below.
in file "a" and not 16 bits.

Makes no sense to me.

Internally, (that is, in memory), Java represents characters as
Unicode. Externally (in files, on the wire, etc.), characters are
"encoded" into one or more bytes, using some encoding. The most
common ones are:

UTF-16: two bytes for each character. Includes all of Unicode.
UTF-8: one byte for ASCII charatcers (0-127); two or three bytes for
other characters Includes all of Unicode.
ASCII: one byte per character. Includes only the first 127 Unicode
characters.
CP-1262: one byte per character, including all the ASCII characters
plus some MSoft-specific extension. Includes 256 Unicode characters.
ISO-LATIN-1 one byte per character, including all the ASCII characters
plus some special characters usied in European languages. Includes 256
Unicode characters.

There are many others. If you don't specify an encoding, as in your
example, Java chooses a default one which is system-dependent.
Encodings will, in general, replace characters they don't contain by a
question mark, which is what you're seeing. (I don't know what your
system's default encoding is. If you're on Windows, it's probably
CP-1262, but ASCII would do the same thing, since neither of them
contains the character 256.).

This is a complicated subject, and I've omitted many issues (including
the fact that Unicode now requires 21 bits to represent all of its
characters, not 16). I hope that this helped, but to really
understand it you'll need to find a more detailed writeup. Here's a
start: http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings
 
M

Mike Schilling

I was trying to keep things simple by pretending that Unicode is still
16 bits. Time enough to introduce surrogate pairs later on.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top