'A'++ == 'B': Always True?

F

Fritz Foetzl

The program

public class IncrementCharMain {
public static void main(String[] args) {
char ltr = 'A';
System.out.println (ltr);
ltr++;
System.out.println(ltr);
}
}

produces the output

A
B

on my system, as I'd expect. However, I'm not very familiar with
Unicode or with Java's handling of character data. Will this program
always produce the same output, regardless of the platform hosting the
JRE/JVM? If not, is there a better way to get "the next letter"
without writing a long and clunky switch statement?

(I'm cautious because in C, incrementing characters is non-portable;
although adjacent letters have consecutive values in ASCII, this is
not true of EBCDIC. I've never actually seen an EBCDIC system, but
presumably they're out there.)

ff
 
M

Mike Schilling

Fritz Foetzl said:
The program

public class IncrementCharMain {
public static void main(String[] args) {
char ltr = 'A';
System.out.println (ltr);
ltr++;
System.out.println(ltr);
}
}

produces the output

A
B

on my system, as I'd expect. However, I'm not very familiar with
Unicode or with Java's handling of character data. Will this program
always produce the same output, regardless of the platform hosting the
JRE/JVM? If not, is there a better way to get "the next letter"
without writing a long and clunky switch statement?

(I'm cautious because in C, incrementing characters is non-portable;
although adjacent letters have consecutive values in ASCII, this is
not true of EBCDIC. I've never actually seen an EBCDIC system, but
presumably they're out there.)

Yes. All Java systems use Unicode as their in-memory character set.
 
C

Chris Smith

Fritz said:
(I'm cautious because in C, incrementing characters is non-portable;
although adjacent letters have consecutive values in ASCII, this is
not true of EBCDIC. I've never actually seen an EBCDIC system, but
presumably they're out there.)

No need to worry too much. Incrementing 'A' will always produce 'B',
because the two code points are adjacent in Unicode (Unicode is a
superset of ASCII).

The only case where your program would produce different output is in
the unlikely event that the system default encoding doesn't actually
represent one or both of the characters 'A' and 'B'. Then, of course,
it's impossible to print 'A' or 'B', and a placeholder character
(generally '?') will be printed instead. That relates to output,
though; the variables still have values of 'A' and 'B', but those values
just may not print correctly to the screen.

An EBCDIC system is likewise translated as part of the output process,
so the actual values of variables in Java are still Unicode code points.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
G

Gordon Beaton

'A' cannot be incremented, it is not a variable :p

And even if it could, the expression 'A'++ == 'B' would be false due
to the choice of postincrement operator.

A better example would have been 'A' + 1 == 'B', for both reasons.

/gordon
 
M

Michael Borgwardt

Chris said:
The only case where your program would produce different output is in
the unlikely event that the system default encoding doesn't actually
represent one or both of the characters 'A' and 'B'.

But this would only matter during compilation. Once the program is compiled,
it will produce the same results on any standards-conforming JVM.
 
C

Chris Smith

Michael said:
But this would only matter during compilation. Once the program is compiled,
it will produce the same results on any standards-conforming JVM.

No, that's not true. Use of the platform default encoding is one of
several ways of introducing platform-dependency to a Java application.
System.out is an instance of PrintStream which is initialized to the
platform default encoding. The actual bytes sent to standard output are
dependent on that encoding.

As I said, it's unlikely to be an issue when there are ASCII characters
involved, since very few encodings omit them.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Michael Borgwardt

Chris said:
No, that's not true.

OK, it might influence what actually shows up on STDOUT, but values in the
program itself would not be affected (they might be, during compilation).
 
C

Chris Smith

Michael said:
OK, it might influence what actually shows up on STDOUT, but values in the
program itself would not be affected (they might be, during compilation).

Yes, that's true. I don't understand your point about compilation. Are
you referring to the possibility that Fritx intends to write the code
that he posted, but doesn't because of platform encoding issues? That's
even more tenuous, because it's really impossible to write Java code in
a language that doesn't have at least the basic ASCII characters (even
Unicode escapes require a backslash and the letter u).

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Michael Borgwardt

Chris said:
Yes, that's true. I don't understand your point about compilation. Are
you referring to the possibility that Fritx intends to write the code
that he posted, but doesn't because of platform encoding issues?

I meant that while the unicode code point 'A' is follwed numerically by 'B',
the literal 'A' in the source code could theoretically end up not being
turned into the unicode character 'A', if the compiler uses some sort of
really weird encoding.

That's
even more tenuous, because it's really impossible to write Java code in
a language that doesn't have at least the basic ASCII characters (even
Unicode escapes require a backslash and the letter u).

Hey, it was *you* who started talking about encodings that don't know 'A'!
*are* there actually such pathological cases?

OTOH, considering encodings pathological for not including latin letters
is kinda ethnocentric...
 
C

Chris Smith

Michael said:
Hey, it was *you* who started talking about encodings that don't know 'A'!
*are* there actually such pathological cases?

I don't know for certain. I have vague memories of a Japanese encoding
that cannot represent all ASCII characters... but I'm not sure if it
lacks 'A' and 'B' or not.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
S

steve

The program

public class IncrementCharMain {
public static void main(String[] args) {
char ltr = 'A';
System.out.println (ltr);
ltr++;
System.out.println(ltr);
}
}

produces the output

A
B

on my system, as I'd expect. However, I'm not very familiar with
Unicode or with Java's handling of character data. Will this program
always produce the same output, regardless of the platform hosting the
JRE/JVM? If not, is there a better way to get "the next letter"
without writing a long and clunky switch statement?

(I'm cautious because in C, incrementing characters is non-portable;
although adjacent letters have consecutive values in ASCII, this is
not true of EBCDIC. I've never actually seen an EBCDIC system, but
presumably they're out there.)

ff

as long as you "assume" you are using the western alphabet.

It would break on all south asian languages ( taiwan/China /Japan), I would
guess any language using picto- grams ,would mess it up.

An interesting one , would be german, Norwegian, or possibly spanish, but i
don't know enough about these languages.

personally , I would use an array , or character string, then index into it.
that way i would have complete control on what it returns.

steve
 
D

Doug Pardee

Okay, everyone's been running all over the place with this one,
dragging in all sorts of irrelevant stuff like Oriental languages.
Let's nail this puppy down.

First, I'm going to ignore the title ('A'++ == 'B': Always True?)
because as Vincent and Gordon noted, it has two silly typos. Let's talk
about the example code instead.

First, the character constant 'A' will *always* result in the char
value of \u0041. This is defined by chapter 3 of the Java Language
Specification (second edition) and by the nature of Unicode. Anyone who
thinks that Oriental languages matter doesn't understand the whole
point of Unicode: it combines ALL language 'glyphs' into a single
unambiguous numbering system. Unicode values \0000-\007F always
represent ASCII codes 0-127, okay?

So, after the initial assignment ltr='A' we know that the value in ltr
is 0x0041. Guaranteed, because 'A' is a character constant.

Then after the increment ltr++, we know that the value in ltr is
0x0042, by the rules of 16-bit unsigned arithmetic. This we know
corresponds to the value of 'B', because 'B' is also a character
constant.

But! We cannot say absolutely for sure what will be printed. The
System.out.println results in a conversion from the 16-bit Unicode
value in the char to 8-bit byte(s). This conversion is performed "using
the default character encoding." MOst of the time this is a theoretical
issue; you'll usually be using a character encoding that translates
0x0000-0x007F into byte values 0x00-0x7F. A counter-example would be if
you were running on an IBM mainframe, and the default character
encoding for the JVM is something like ebcdic-us; in that case the
output byte values would be EBCDIC (e.g. outputting 0x0041 would result
in 0xC1, which is an EBCDIC 'A').

Then there is the question of how those byte values will be interpreted
by your display (or print) device. Again, this is primarily a
theoretical issue, as most display and print devices map byte values
0x00-0x7F into the appropriate ASCII glyphs. Again, one counter-example
is that IBM mainframes terminals and printers will be expecting EBCDIC
codes.

So: the computation is defined to do what you expect. It will produce
the exact same output as System.out.println("A\nB"), whatever that
output might actually look like.

Now for the question that you didn't ask. If you're working with
characters that you read in, rather than 'X' constants that you coded
into your program, then you have another character-set conversion to
worry about. The basic rules are the same as for the output: in most
cases you'll have a direct conversion from the ASCII range 0x00-0x7F
into 0x0000-0x007F. But there could be unusual cases such as EBCDIC. In
general, these unusual cases will be automagically handled by "the
default character encoding" for your JVM and environment.

There are two main places that you can run into grief:
1) the default character encoding turns out not to be appropriate, or
2) you try to read/write binary data through the character encoder.

In summary: character constants coded into your Java source are
absolutely defined. Computations on character values cannot produce
variable results. What can vary is the *external* representation on
disk, screen, printer, keyboard, TCP/IP packet, or whatever. This is
because those media almost universally expect 8-bit coding, and there
are any number of possible mappings between those 8-bit values and the
internal 16-bit Unicode values.

In most simple cases, the default mapping (encoding) will do, but
different JVMs running in different environments may have different
default encodings. This can cause trouble if, for example, your program
finds itself running on an IBM mainframe and uses the default EBCDIC
encoding to try to write TCP/IP packet data which has to be in ASCII.
You can always specify a particular encoding, but then you need to be
careful to specify the right encoding for the situation; e.g., don't
force ASCII encoding on an EBCDIC machine if you're outputting to a
native EBCDIC device.
 
C

Chris Smith

steve said:
as long as you "assume" you are using the western alphabet.

It would break on all south asian languages ( taiwan/China /Japan), I would
guess any language using picto- grams ,would mess it up.

No, that's not correct. Fritz, please pay attention to other answers.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
C

Chris Smith

Doug Pardee said:
Okay, everyone's been running all over the place with this one,
dragging in all sorts of irrelevant stuff like Oriental languages.

Yes, and it's been such fun. ;)
First, the character constant 'A' will *always* result in the char
value of \u0041. This is defined by chapter 3 of the Java Language
Specification (second edition) and by the nature of Unicode. Anyone who
thinks that Oriental languages matter doesn't understand the whole
point of Unicode: it combines ALL language 'glyphs' into a single
unambiguous numbering system. Unicode values \0000-\007F always
represent ASCII codes 0-127, okay?

I think you missed the point of Michael's response. That was that if
the source file was written in a text editor, and then saved in some
encoding that doesn't represent the character A, then the input to the
compiler could be wrong, and hence the result would be wrong.

It's a silly little exercise in abstract thought, and not really meant
for the OP at all. If you insist on making me take a side, it would be
that this possibility is irrelevant in discussing the result of Fritz's
code anyway, since in this scenario Fritz's code never really gets run.
I'd consider this analogous to proposing that disk corruption of the
class file might cause the wrong result.

Nevertheless, despite by disagreement with the semantic matter, I
believe Michael understands unicode quite well.
But! We cannot say absolutely for sure what will be printed. [...]
A counter-example would be if
you were running on an IBM mainframe,

I don't think that's relevant. A typical EBCDIC machine would take a
different route to get there, but the resulting output would still be an
'A' followed by a 'B'.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Michael Borgwardt

Chris said:
I don't know for certain. I have vague memories of a Japanese encoding
that cannot represent all ASCII characters... but I'm not sure if it
lacks 'A' and 'B' or not.

I happen to be pretty familar with Japanese encodings, and there are none
that don't support the latin letters. In fact, most support them *twice*,
once in an ASCII-compatible way and once as "full-width" versions, i.e.
filling a square like the Japanese characters do.

What you remember is probably the absence of the backslash and tilde in
a very early Japanese encoding (that influenced the current ones), which
were replaced by the yen sign an an overline. This has the result that
when you switch Windows 2000 or XP to Japanese encoding for non-unicode
applications, all file paths suddenly use yen signs instead of backslashes...
 
M

Michael Borgwardt

Chris said:
I think you missed the point of Michael's response. That was that if
the source file was written in a text editor, and then saved in some
encoding that doesn't represent the character A,

Or represents it in a different way than the compiler expects.

I guess my real point was that while Java chars and literals are Unicode, Java
source code before being fed to the compiler consists of bytes, i.e. is
*not* Unicode.
 
J

John C. Bollinger

Chris said:
Yes, and it's been such fun. ;)




I think you missed the point of Michael's response. That was that if
the source file was written in a text editor, and then saved in some
encoding that doesn't represent the character A, then the input to the
compiler could be wrong, and hence the result would be wrong.

But I think Michael's point, albeit correct in some sense, was not
germane. The OP was asking about Java source, but Michael was focusing
on the _representation_ of the Java source. Something like, "the byte
sequence you presented might be interpreted under some character
encodings as Java source where the equality condition in question is
false." (By all means correct me if I have misrepresented the point.)
That misses on a very important issue: if we are considering Java source
then we are necessarily considering _character_ data already. This is a
key aspect of the Java programming language, albeit a sufficiently
low-level one that it rarely rates a second thought. If the Java source
is character data then converting it to bytes according to some
(possibly identity) conversion and converting it back to characters via
some different mechanism might do any manner of nasty things to it,
true, but it's an invalid transformation.
It's a silly little exercise in abstract thought, and not really meant
for the OP at all. If you insist on making me take a side, it would be
that this possibility is irrelevant in discussing the result of Fritz's
code anyway, since in this scenario Fritz's code never really gets run.
I'd consider this analogous to proposing that disk corruption of the
class file might cause the wrong result.

And in some sense I'm continuing the silly little exercise by truing to
poke a hole in the argument. I think the analogy to class file
corruption is particularly apt in light of my take on the issue (above).
Nevertheless, despite by disagreement with the semantic matter, I
believe Michael understands unicode quite well.

I have no doubt about that.


John Bollinger
(e-mail address removed)
 
D

Doug Pardee

A clarification: my posting was trying to combine all of the points
from the thread into a single comprehensive posting, while directly
refuting the following bit of misinformation:

steve> It would break on all south asian languages
steve> ( taiwan/China /Japan), I would guess any language
steve> using picto- grams ,would mess it up.

Now, as to Chris Smith and Michael Borgwardt's discussion:

Chris> I think you missed the point of Michael's response.
Chris> That was that if the source file was written in a
Chris> text editor, and then saved in some encoding that
Chris> doesn't represent the character A,

Michael> Or represents it in a different way than the
Michael> compiler expects.

I didn't miss the point at all. I think that Michael's point is not
well-taken, and I stated that the hypothesized problem simply can't
happen. Chris had already said pretty much the same thing:

Chris> it's really impossible to write Java code in a language that
Chris> doesn't have at least the basic ASCII characters (even
Chris> Unicode escapes require a backslash and the letter u).

If you can't generally trust the compiler to implement the Java
Language Specification correctly, you're doomed. And specifically, if
the compiler can't look at an 'A' in your source code and get it as
\u0041, then you'll never be able to catch an
ArrayIndexOutOfBoundsException thrown by the JVM.

Doug> you'll usually be using a character encoding that
Doug> translates 0x0000-0x007F into byte values 0x00-0x7F.
Doug> A counter-example would be if
Doug> you were running on an IBM mainframe,

Chris> I don't think that's relevant. A typical EBCDIC machine would
Chris> take a different route to get there, but the resulting output
Chris> would still be an 'A' followed by a 'B'.

For the simple case of System.out.println() as shown by the OP, I'll
agree that it's (virtually) irrelevant.

However, it IS relevant if you're running on an EBCDIC machine but
working with a non-EBCDIC data stream or I/O device. The example that I
already gave was TCP/IP data. Another example that could surprise the
programmer would be outputting "RS" to a ByteArrayOutputStream using
the default encoding, which can result in the array {0xD9, 0xE2} when
run on an EBCDIC machine instead of the {0x52,0x53} the programmer
always saw when running on an ASCII machine.

The OP (to whom I was responding) was asking an important question
about a fundamental difference between the languages that he was
accustomed to and Java. Languages like C process characters internally
in 'native' form with no translation during I/O, while Java processes
characters internally in Unicode and translates during I/O. This
difference trips up a LOT of beginning Java programmers, and I felt
that it was worthwhile to be explicit about what was going on.
 
M

Michael Borgwardt

Doug said:
If you can't generally trust the compiler to implement the Java
Language Specification correctly, you're doomed. And specifically, if
the compiler can't look at an 'A' in your source code and get it as
\u0041, then you'll never be able to catch an
ArrayIndexOutOfBoundsException thrown by the JVM.

You really didn't get the point. The Java Language Specification isn't
even relevant at this point. The source code is originally composed of
bytes, not characters. And the compiler has to use some sort of encoding
to convert these bytes into characters. It's not the fault of the compiler
than it may use a wrong one.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top