.read() returns a char why?

JM · Dec 12, 2007

Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

.... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

I am engineer and not a comp.sci so I'd appreciate some patience in
your reply.

Jonathan

Chris Dollin · Dec 12, 2007

JM said:
The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

Java's chars have always been two bytes, so as to store 16-bit
Unicode characters.

(We'll pass quietly over the problems with Unicode now needing more than
16 bits for an unpacked character.)

Mike Schilling · Dec 12, 2007

JM said:
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1.

That's exactly right. If it returned a char, there would be no
"illegal" value left to indicate EOF.

Incidentally when did the character
(singular) become two bytes?

A char in Java is a 16-bit unicode (technically UTF-16) character, not
a byte.

Patricia Shanahan · Dec 12, 2007

JM said:
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

Yes, read returns a wider type than char so that there is a spare value
to represent end-of-stream.

One of the continuing trends in computing has been increasing numbers of
bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
bits.

Patricia

Lew · Dec 12, 2007

JM said:
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

.... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1.

It allows any value in the range of char to be represented as a positive
value. -1 is therefore guaranteed to be distinct from any valid value.

If you return a char, you cannot get the value 32768 or larger.

Incidentally when did the character (singular) become two bytes?

In Java's case, with the invention of Java.

Lew · Dec 12, 2007

Lew said:
If you return a char, you cannot get the value 32768 or larger.

Oops, that's wrong. If you return a *short* you cannot get such values.

Roedy Green · Dec 12, 2007

Incidentally when did the character
(singular) become two bytes?

with Java 1.0. C++ is in transition from 8 to 16.

It is now much more common to have a document containing multiple
languages. You can't encode it with only 8-bits per char. So Java
from day one used Unicode, which has 16-bits per char. Unicode-16 was
even big enough to include Chinese. However, Unicode has since been
extended to 32-bits to allow Ugaritic (cuneiform), musical symbols,
Cypriot etc. Java has somewhat bailing wire support for 32-bit
Unicode.

See http://mindprod.com/jgloss/unicode.html

Of course this would make documents on average twice as big as they
used to be. So UTF-8 was invented to make simple documents almost as
compact as if they have been encoded with an 8-bit national encoding.

see http://mindprod.com/jgloss/utf.html

Encoding is about how documents are encoded which is very complicated
and varied to deal with interchange with other computer languages and
legacy applications. Internally they are all stored simply in
Unicode-16.

See http://mindprod.com/jgloss/encoding.html

John W. Kennedy · Dec 12, 2007

Patricia said:
One of the continuing trends in computing has been increasing numbers of
bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
bits.

Not if you go back far enough, though. The IBM 650 took 14 bits to
represent a character (double bi-quinary), and its market successor, the
707x series, took 10 (double 2-of-5).

Roedy Green · Dec 13, 2007

Not if you go back far enough, though. The IBM 650 took 14 bits to
represent a character (double bi-quinary), and its market successor, the
707x series, took 10 (double 2-of-5).

In the olden days, each site would invent its own private 6-bit
encoding. I recall sitting with Vern Detwiler (later of MacDonald
Detwiler) looking at this new fangled 7-bit ASCII code and playing
with how we might make UBC's 6-bit code somewhat ASCII compatible for
the new IBM 7044. We had to decide what characters to include. Back
then popular characters included the word mark and record mark.

Later with the IBM 360 we had ENORMOUS 8-bit EBCDIC character sets
that came in a zillion variants. You still constrained yourself mainly
to upper case because printers used a rotating chain or band of
pre-formed characters, and extra chars slowed it down drastically.

JM · Dec 15, 2007

That's exactly right. If it returned a char, there would be no
"illegal" value left to indicate EOF.

A char inJavais a 16-bit unicode (technically UTF-16) character, not
a byte.

Many thanks for everyone's replied. Now what does not make sense is
when I call BufferedWriter.write(int) only one 8 bit byte gets
written.

BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
bw.write(1);
bw.write(256);
bw.close();
System.exit(0);

Creates a file of length 2 (bytes) containing
01
3F
in file "a" and not 16 bits.

Makes no sense to me.

Jonathan

Lew · Dec 15, 2007

JM said:
BufferedWriter bw = new BufferedWriter(new FileWriter("a"));

Don't use TAB characters in Usenet listings. It makes them very hard to read.

bw.write(1);
bw.write(256);
bw.close();
System.exit(0);

Creates a file of length 2 (bytes) containing
01
3F
in file "a" and not 16 bits.

Makes no sense to me.

What is the default character encoding for your platform?

The Writer will translate the String into that encoding unless you specify a
different one. Many encodings use only one byte per character, or one per the
each of the most common characters. It seems that UTF-16 is not your default
encoding for files, eh?

Google for "character encoding" and "Unicode", and read the material about
these concepts on java.sun.com, then ask about what is left out in those
references.

Mike Schilling · Dec 15, 2007

JM said:
Many thanks for everyone's replied. Now what does not make sense is
when I call BufferedWriter.write(int) only one 8 bit byte gets
written.

BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
bw.write(1);
bw.write(256);
bw.close();
System.exit(0);

Creates a file of length 2 (bytes) containing
01
3F

note that 3F isn't 256; it's an ASCII question mark (?). I'll explain
why below.

in file "a" and not 16 bits.

Makes no sense to me.

Internally, (that is, in memory), Java represents characters as
Unicode. Externally (in files, on the wire, etc.), characters are
"encoded" into one or more bytes, using some encoding. The most
common ones are:

UTF-16: two bytes for each character. Includes all of Unicode.
UTF-8: one byte for ASCII charatcers (0-127); two or three bytes for
other characters Includes all of Unicode.
ASCII: one byte per character. Includes only the first 127 Unicode
characters.
CP-1262: one byte per character, including all the ASCII characters
plus some MSoft-specific extension. Includes 256 Unicode characters.
ISO-LATIN-1 one byte per character, including all the ASCII characters
plus some special characters usied in European languages. Includes 256
Unicode characters.

There are many others. If you don't specify an encoding, as in your
example, Java chooses a default one which is system-dependent.
Encodings will, in general, replace characters they don't contain by a
question mark, which is what you're seeing. (I don't know what your
system's default encoding is. If you're on Windows, it's probably
CP-1262, but ASCII would do the same thing, since neither of them
contains the character 256.).

This is a complicated subject, and I've omitted many issues (including
the fact that Unicode now requires 21 bits to represent all of its
characters, not 16). I hope that this helped, but to really
understand it you'll need to find a more detailed writeup. Here's a
start: http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings

Lew · Dec 15, 2007

Mike said:
UTF-8: one byte for ASCII charatcers (0-127); two or three or four
bytes for other characters Includes all of Unicode.

Mike Schilling · Dec 15, 2007

I was trying to keep things simple by pretending that Unicode is still
16 bits. Time enough to introduce surrogate pairs later on.

BufferedReader Work Around	2	Nov 15, 2010
Try to read a file faster	11	Sep 29, 2004
read and parse file	3	Feb 8, 2005
Read in & count characters from a text file	10	Aug 4, 2007
Do you know how Java read character value greater than 128/255?	6	Dec 13, 2006
Questions about buffered streams	28	Jan 6, 2008
Read a line under MS/Unix/Mac	16	Nov 5, 2007
Printing only an 'A' to System.out	8	Apr 22, 2009

.read() returns a char why?

JM

Chris Dollin

Mike Schilling

Patricia Shanahan

Lew

Lew

Roedy Green

John W. Kennedy

Roedy Green

JM

Lew

Mike Schilling

Lew

Mike Schilling

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads