'A'++ == 'B': Always True?

F

Fritz Foetzl

[snip]
The OP (to whom I was responding) was asking an important question
about a fundamental difference between the languages that he was
accustomed to and Java. Languages like C process characters internally
in 'native' form with no translation during I/O, while Java processes
characters internally in Unicode and translates during I/O. This
difference trips up a LOT of beginning Java programmers, and I felt
that it was worthwhile to be explicit about what was going on.

....and the OP appreciates it. This has been a lively, stimulating
discussion - better than I anticipated. The difference between
character I/O and internal processing is important, and I've learned
much from reading this thread. Thanks to all who have responded!

ff
 
G

Gary Labowitz

Doug> you'll usually be using a character encoding that
Doug> translates 0x0000-0x007F into byte values 0x00-0x7F.
Doug> A counter-example would be if
Doug> you were running on an IBM mainframe,

Chris> I don't think that's relevant. A typical EBCDIC machine would
Chris> take a different route to get there, but the resulting output
Chris> would still be an 'A' followed by a 'B'.

Not true. EBCDIC does not guarantee that the alphabetic characters are
contiguous. I believe 'R' is not followed by 'S'. There may also be other
"breaks" in the sequence.
 
G

Gary Labowitz

Fritz Foetzl said:
"Doug Pardee" <[email protected]> wrote in message
[snip]

The OP (to whom I was responding) was asking an important question
about a fundamental difference between the languages that he was
accustomed to and Java. Languages like C process characters internally
in 'native' form with no translation during I/O, while Java processes
characters internally in Unicode and translates during I/O. This
difference trips up a LOT of beginning Java programmers, and I felt
that it was worthwhile to be explicit about what was going on.

...and the OP appreciates it. This has been a lively, stimulating
discussion - better than I anticipated. The difference between
character I/O and internal processing is important, and I've learned
much from reading this thread. Thanks to all who have responded!

Interesting. Also, since the OP used the postfix operator, I'm just
wondering if the 'A' wasn't being compared to 'B' and would therefore always
be false.
As confused as ever, I remain
 
C

Chris Smith

Gary Labowitz said:
Not true. EBCDIC does not guarantee that the alphabetic characters are
contiguous. I believe 'R' is not followed by 'S'. There may also be other
"breaks" in the sequence.

That doesn't matter. The point is that the literal 'A' is a unicode
code point. Incrementing it will always give the code point for 'B'.
The translation to EBCDIC is only performed during the output phase.
The resulting output may not contain consecutive EBCDIC values (I don't
know enough about EBCDIC to say whether it will or not), but it WILL be
A followed by B -- not because A and B are consecutive in EBCDIC, but
because A and B are consecutive in Unicode.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
G

Gary Labowitz

Chris Smith said:
That doesn't matter. The point is that the literal 'A' is a unicode
code point. Incrementing it will always give the code point for 'B'.
The translation to EBCDIC is only performed during the output phase.
The resulting output may not contain consecutive EBCDIC values (I don't
know enough about EBCDIC to say whether it will or not), but it WILL be
A followed by B -- not because A and B are consecutive in EBCDIC, but
because A and B are consecutive in Unicode.

You are getting me confused. EBCDIC is not Unicode. Are you saying that on
an EBCDIC machine it uses Unicode for internal storage of data?
Unicode: 'A' = \u0041
EBCDIC: 'A' = 0xF0
 
J

John C. Bollinger

Michael said:
You really didn't get the point. The Java Language Specification isn't
even relevant at this point. The source code is originally composed of
bytes, not characters. And the compiler has to use some sort of encoding
to convert these bytes into characters. It's not the fault of the compiler
than it may use a wrong one.

And here again voice my dissent. The source code is composed of
characters. JLS says so. The _representation_ of the source code in
most media consists of bytes, but that's not what we said we were
talking about, and it's not a generally useful thing to bring up in
discussion of language issues. As you say, the compiler must decode the
source code representation correctly in order to produce classes the
properly correspond to that source, but that's not a language issue,
it's a tools issue. Am I splitting hairs? Certainly! But so is all
the rest of this subthread.


John Bollinger
(e-mail address removed)
 
C

Chris Smith

Gary Labowitz said:
You are getting me confused. EBCDIC is not Unicode. Are you saying that on
an EBCDIC machine it uses Unicode for internal storage of data?

Yes. That's a quite fundamental concept of Java. Java *always* uses
Unicode to store internal character data. If it's running on an EBCDIC
machine, then it translates to EBCDIC during the output process.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Michael Borgwardt

Gary said:
You are getting me confused. EBCDIC is not Unicode. Are you saying that on
an EBCDIC machine it uses Unicode for internal storage of data?

As far as Java is concerned, the only thing that distinguishes an
"BCDIC machine" from, say, an "ASCII machine" is the platform default
encoding, which is used for Char/String <--> byte[]/file conversion
in cases where the encoding is not explicitly specified.

Java chars and Strings are Unicode, or at least must behave as if they
were. A JVM implementation is free to use EBCDIC for its internal storage
of chars, but to fulfill the JLS, char must behave in every aspect as if
it were a 16 bit unicode value, which means that 'A'+1 == 'B'. Since that
means that using EBCDIC for internal storage would make the JVM
implementation complex and inefficient without (IMO) any gains to show
for it, yes, JVMs on an "EBCDIC machine" are going to use Unicode for
internal storage of data.
 
M

Michael Borgwardt

John said:
And here again voice my dissent. The source code is composed of
characters. JLS says so. The _representation_ of the source code in
most media consists of bytes, but that's not what we said we were
talking about, and it's not a generally useful thing to bring up in
discussion of language issues.

Well, IMO we were talking about whether 'A'+1 == 'B' always.
In practice, this expression is entered in some kind of text editor and
saved as a file, which is then fed to a compiler. So in practice
it may end up being false in some circumstances. But only through faulty
assumptions about or misuage of the tools, not the language.
As you say, the compiler must decode the
source code representation correctly in order to produce classes the
properly correspond to that source, but that's not a language issue,
it's a tools issue. Am I splitting hairs? Certainly! But so is all
the rest of this subthread.

Certainly.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top