.read() returns a char why?

Discussion in 'Java' started by JM, Dec 12, 2007.

  1. JM

    JM Guest

    Why do the Java Reader classes (File/Buffered/Stream) etc .read()
    methods return an int not a char?

    For example the javadoc for BufferedReader

    .... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

    "public int read()"

    and then the javadoc indicates return value as:

    "The character read, as an integer in the range 0 to 65535
    (0x00-0xffff), or -1 if the end of the stream has been reached"

    The only reason I have come up with is that the class wants to
    indicate end-of-stream with a -1. Incidentally when did the character
    (singular) become two bytes?

    I am engineer and not a comp.sci so I'd appreciate some patience in
    your reply.

    Jonathan
     
    JM, Dec 12, 2007
    #1
    1. Advertising

  2. JM

    Chris Dollin Guest

    JM wrote:

    > The only reason I have come up with is that the class wants to
    > indicate end-of-stream with a -1. Incidentally when did the character
    > (singular) become two bytes?


    Java's chars have always been two bytes, so as to store 16-bit
    Unicode characters.

    (We'll pass quietly over the problems with Unicode now needing more than
    16 bits for an unpacked character.)

    --
    Chris "whistling, but not in the dark" Dollin

    Hewlett-Packard Limited registered office: Cain Road, Bracknell,
    registered no: 690597 England Berks RG12 1HN
     
    Chris Dollin, Dec 12, 2007
    #2
    1. Advertising

  3. JM wrote:
    > Why do the Java Reader classes (File/Buffered/Stream) etc .read()
    > methods return an int not a char?
    >
    > For example the javadoc for BufferedReader
    >
    > ... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares
    >
    > "public int read()"
    >
    > and then the javadoc indicates return value as:
    >
    > "The character read, as an integer in the range 0 to 65535
    > (0x00-0xffff), or -1 if the end of the stream has been reached"
    >
    > The only reason I have come up with is that the class wants to
    > indicate end-of-stream with a -1.


    That's exactly right. If it returned a char, there would be no
    "illegal" value left to indicate EOF.

    > Incidentally when did the character
    > (singular) become two bytes?


    A char in Java is a 16-bit unicode (technically UTF-16) character, not
    a byte.
     
    Mike Schilling, Dec 12, 2007
    #3
  4. JM wrote:
    > Why do the Java Reader classes (File/Buffered/Stream) etc .read()
    > methods return an int not a char?
    >
    > For example the javadoc for BufferedReader
    >
    > ... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares
    >
    > "public int read()"
    >
    > and then the javadoc indicates return value as:
    >
    > "The character read, as an integer in the range 0 to 65535
    > (0x00-0xffff), or -1 if the end of the stream has been reached"
    >
    > The only reason I have come up with is that the class wants to
    > indicate end-of-stream with a -1. Incidentally when did the character
    > (singular) become two bytes?


    Yes, read returns a wider type than char so that there is a spare value
    to represent end-of-stream.

    One of the continuing trends in computing has been increasing numbers of
    bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
    bits.

    Patricia
     
    Patricia Shanahan, Dec 12, 2007
    #4
  5. JM

    Lew Guest

    JM wrote:
    > Why do the Java Reader classes (File/Buffered/Stream) etc .read()
    > methods return an int not a char?
    >
    > For example the javadoc for BufferedReader
    >
    > .... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares
    >
    > "public int read()"
    >
    > and then the javadoc indicates return value as:
    >
    > "The character read, as an integer in the range 0 to 65535
    > (0x00-0xffff), or -1 if the end of the stream has been reached"
    >
    > The only reason I have come up with is that the class wants to
    > indicate end-of-stream with a -1.


    It allows any value in the range of char to be represented as a positive
    value. -1 is therefore guaranteed to be distinct from any valid value.

    If you return a char, you cannot get the value 32768 or larger.

    > Incidentally when did the character (singular) become two bytes?


    In Java's case, with the invention of Java.

    --
    Lew
     
    Lew, Dec 12, 2007
    #5
  6. JM

    Lew Guest

    Lew wrote:
    > If you return a char, you cannot get the value 32768 or larger.


    Oops, that's wrong. If you return a *short* you cannot get such values.

    --
    Lew
     
    Lew, Dec 12, 2007
    #6
  7. JM

    Roedy Green Guest

    On Wed, 12 Dec 2007 06:45:36 -0800 (PST), JM <>
    wrote, quoted or indirectly quoted someone who said :

    >Incidentally when did the character
    >(singular) become two bytes?


    with Java 1.0. C++ is in transition from 8 to 16.

    It is now much more common to have a document containing multiple
    languages. You can't encode it with only 8-bits per char. So Java
    from day one used Unicode, which has 16-bits per char. Unicode-16 was
    even big enough to include Chinese. However, Unicode has since been
    extended to 32-bits to allow Ugaritic (cuneiform), musical symbols,
    Cypriot etc. Java has somewhat bailing wire support for 32-bit
    Unicode.

    See http://mindprod.com/jgloss/unicode.html

    Of course this would make documents on average twice as big as they
    used to be. So UTF-8 was invented to make simple documents almost as
    compact as if they have been encoded with an 8-bit national encoding.

    see http://mindprod.com/jgloss/utf.html

    Encoding is about how documents are encoded which is very complicated
    and varied to deal with interchange with other computer languages and
    legacy applications. Internally they are all stored simply in
    Unicode-16.

    See http://mindprod.com/jgloss/encoding.html
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Dec 12, 2007
    #7
  8. Patricia Shanahan wrote:
    > One of the continuing trends in computing has been increasing numbers of
    > bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
    > bits.


    Not if you go back far enough, though. The IBM 650 took 14 bits to
    represent a character (double bi-quinary), and its market successor, the
    707x series, took 10 (double 2-of-5).

    --
    John W. Kennedy
    "The grand art mastered the thudding hammer of Thor
    And the heart of our lord Taliessin determined the war."
    -- Charles Williams. "Mount Badon"
     
    John W. Kennedy, Dec 13, 2007
    #8
  9. JM

    Roedy Green Guest

    On Wed, 12 Dec 2007 19:03:54 -0500, "John W. Kennedy"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >Not if you go back far enough, though. The IBM 650 took 14 bits to
    >represent a character (double bi-quinary), and its market successor, the
    >707x series, took 10 (double 2-of-5).


    In the olden days, each site would invent its own private 6-bit
    encoding. I recall sitting with Vern Detwiler (later of MacDonald
    Detwiler) looking at this new fangled 7-bit ASCII code and playing
    with how we might make UBC's 6-bit code somewhat ASCII compatible for
    the new IBM 7044. We had to decide what characters to include. Back
    then popular characters included the word mark and record mark.

    Later with the IBM 360 we had ENORMOUS 8-bit EBCDIC character sets
    that came in a zillion variants. You still constrained yourself mainly
    to upper case because printers used a rotating chain or band of
    pre-formed characters, and extra chars slowed it down drastically.
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Dec 13, 2007
    #9
  10. JM

    JM Guest

    On Dec 12, 2:58 pm, "Mike Schilling" <>
    wrote:
    > JM wrote:
    > > Why do theJavaReader classes (File/Buffered/Stream) etc .read()
    > > methods return an int not a char?

    >
    > > For example the javadoc for BufferedReader

    >
    > > ... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

    >
    > > "public int read()"

    >
    > > and then the javadoc indicates return value as:

    >
    > > "The character read, as an integer in the range 0 to 65535
    > > (0x00-0xffff), or -1 if the end of the stream has been reached"

    >
    > > The only reason I have come up with is that the class wants to
    > > indicate end-of-stream with a -1.

    >
    > That's exactly right. If it returned a char, there would be no
    > "illegal" value left to indicate EOF.
    >
    > > Incidentally when did the character
    > > (singular) become two bytes?

    >
    > A char inJavais a 16-bit unicode (technically UTF-16) character, not
    > a byte.


    Many thanks for everyone's replied. Now what does not make sense is
    when I call BufferedWriter.write(int) only one 8 bit byte gets
    written.

    BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
    bw.write(1);
    bw.write(256);
    bw.close();
    System.exit(0);

    Creates a file of length 2 (bytes) containing
    01
    3F
    in file "a" and not 16 bits.

    Makes no sense to me.

    Jonathan
     
    JM, Dec 15, 2007
    #10
  11. JM

    Lew Guest

    JM wrote:
    >
    > BufferedWriter bw = new BufferedWriter(new FileWriter("a"));


    Don't use TAB characters in Usenet listings. It makes them very hard to read.

    > bw.write(1);
    > bw.write(256);
    > bw.close();
    > System.exit(0);
    >
    > Creates a file of length 2 (bytes) containing
    > 01
    > 3F
    > in file "a" and not 16 bits.
    >
    > Makes no sense to me.


    What is the default character encoding for your platform?

    The Writer will translate the String into that encoding unless you specify a
    different one. Many encodings use only one byte per character, or one per the
    each of the most common characters. It seems that UTF-16 is not your default
    encoding for files, eh?

    Google for "character encoding" and "Unicode", and read the material about
    these concepts on java.sun.com, then ask about what is left out in those
    references.

    --
    Lew
     
    Lew, Dec 15, 2007
    #11
  12. "JM" <> wrote in message
    news:...
    > On Dec 12, 2:58 pm, "Mike Schilling" <>
    > wrote:
    >> JM wrote:
    >> > Why do theJavaReader classes (File/Buffered/Stream) etc .read()
    >> > methods return an int not a char?

    >>
    >> > For example the javadoc for BufferedReader

    >>
    >> > ... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

    >>
    >> > "public int read()"

    >>
    >> > and then the javadoc indicates return value as:

    >>
    >> > "The character read, as an integer in the range 0 to 65535
    >> > (0x00-0xffff), or -1 if the end of the stream has been reached"

    >>
    >> > The only reason I have come up with is that the class wants to
    >> > indicate end-of-stream with a -1.

    >>
    >> That's exactly right. If it returned a char, there would be no
    >> "illegal" value left to indicate EOF.
    >>
    >> > Incidentally when did the character
    >> > (singular) become two bytes?

    >>
    >> A char inJavais a 16-bit unicode (technically UTF-16) character,
    >> not
    >> a byte.

    >
    > Many thanks for everyone's replied. Now what does not make sense is
    > when I call BufferedWriter.write(int) only one 8 bit byte gets
    > written.
    >
    > BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
    > bw.write(1);
    > bw.write(256);
    > bw.close();
    > System.exit(0);
    >
    > Creates a file of length 2 (bytes) containing
    > 01
    > 3F


    note that 3F isn't 256; it's an ASCII question mark (?). I'll explain
    why below.

    > in file "a" and not 16 bits.
    >
    > Makes no sense to me.


    Internally, (that is, in memory), Java represents characters as
    Unicode. Externally (in files, on the wire, etc.), characters are
    "encoded" into one or more bytes, using some encoding. The most
    common ones are:

    UTF-16: two bytes for each character. Includes all of Unicode.
    UTF-8: one byte for ASCII charatcers (0-127); two or three bytes for
    other characters Includes all of Unicode.
    ASCII: one byte per character. Includes only the first 127 Unicode
    characters.
    CP-1262: one byte per character, including all the ASCII characters
    plus some MSoft-specific extension. Includes 256 Unicode characters.
    ISO-LATIN-1 one byte per character, including all the ASCII characters
    plus some special characters usied in European languages. Includes 256
    Unicode characters.

    There are many others. If you don't specify an encoding, as in your
    example, Java chooses a default one which is system-dependent.
    Encodings will, in general, replace characters they don't contain by a
    question mark, which is what you're seeing. (I don't know what your
    system's default encoding is. If you're on Windows, it's probably
    CP-1262, but ASCII would do the same thing, since neither of them
    contains the character 256.).

    This is a complicated subject, and I've omitted many issues (including
    the fact that Unicode now requires 21 bits to represent all of its
    characters, not 16). I hope that this helped, but to really
    understand it you'll need to find a more detailed writeup. Here's a
    start: http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings
     
    Mike Schilling, Dec 15, 2007
    #12
  13. JM

    Lew Guest

    Mike Schilling wrote:
    > UTF-8: one byte for ASCII charatcers (0-127); two or three

    or four
    > bytes for other characters Includes all of Unicode.


    --
    Lew
     
    Lew, Dec 15, 2007
    #13
  14. "Lew" <> wrote in message
    news:...
    > Mike Schilling wrote:
    >> UTF-8: one byte for ASCII charatcers (0-127); two or three

    > or four
    >> bytes for other characters Includes all of Unicode.


    I was trying to keep things simple by pretending that Unicode is still
    16 bits. Time enough to introduce surrogate pairs later on.
     
    Mike Schilling, Dec 15, 2007
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wwj
    Replies:
    7
    Views:
    597
  2. wwj
    Replies:
    24
    Views:
    2,573
    Mike Wahler
    Nov 7, 2003
  3. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    996
    Mark Rae
    Dec 21, 2006
  4. Mr. SweatyFinger
    Replies:
    2
    Views:
    2,266
    Smokey Grindel
    Dec 2, 2006
  5. lovecreatesbeauty
    Replies:
    1
    Views:
    1,156
    Ian Collins
    May 9, 2006
Loading...

Share This Page