# "char math"?

Discussion in 'Java' started by Steven Coco, Sep 8, 2003.

1. ### Steven CocoGuest

Given code such as this:

char c = 'a';
// (int) c == 97,
// int i = 'b' - c == 1,
// and, c >= 'a' && c <= 'h'

(1) Are the commented assumptions guaranteed to be true
programmatically? And,
(2) Is arithmetic with primitive char values valid programming practice?

The best I can do is: based on the language and Unicode specs; such
arithmetic is simply dependent upon the Unicode code code point for 'a'
never changing (no matter even what character encoding is used for the
..java file).

My "beef" is that: it would seem that the integer value of (char) c is
dependent upon the Unicode spec.

Even though the consortium guarantees that the code point for the
character 'a' will never change for the life of the spec, a reliance on
that spec would make this type of programming technically unstable.
This would seem to be scary since java's relational operators can be
(and are) used with char values as those ponted out above.

Peace,
Steev.

Steven Coco, Sep 8, 2003

2. ### Roedy GreenGuest

On Mon, 08 Sep 2003 06:02:24 GMT, Steven Coco
<> wrote or quoted :

>
> char c = 'a';
> // (int) c == 97,
> // int i = 'b' - c == 1,
> // and, c >= 'a' && c <= 'h'

everything works because chars are promoted to int without sign
extension before almost any operation.

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Roedy Green, Sep 8, 2003

3. ### Thomas G. MarshallGuest

Steven Coco <> horrified us with:

> Can someone definitively answer this:
>
> Given code such as this:
>
> char c = 'a';
> // (int) c == 97,
> // int i = 'b' - c == 1,
> // and, c >= 'a' && c <= 'h'
>
> (1) Are the commented assumptions guaranteed to be true
> programmatically? And,
> (2) Is arithmetic with primitive char values valid programming
> practice?
>
> The best I can do is: based on the language and Unicode specs; such
> arithmetic is simply dependent upon the Unicode code code point for
> 'a' never changing (no matter even what character encoding is used
> for the .java file).
>
> My "beef" is that: it would seem that the integer value of (char) c is
> dependent upon the Unicode spec.
>
> Even though the consortium guarantees that the code point for the
> character 'a' will never change for the life of the spec, a reliance
> on that spec would make this type of programming technically unstable.
> This would seem to be scary since java's relational operators can be
> (and are) used with char values as those ponted out above.

I think I see the thrust of your question, and I'm really not sure. You may
have to just trust that the ISO spec does not change.

If there were such a thing as Character.a, which is 'a' regardless of where
the ISO spec places it, then you might be able to use that as a stake in the
ground. Seems to me I've bumped into this before but I cannot find it.

See if
java.lang.Character
java.lang.Character.Subset
java.lang.Character.Unicode
help at all.

(?)

Thomas G. Marshall, Sep 8, 2003
4. ### Niels DybdahlGuest

> (1) Are the commented assumptions guaranteed to be true
> programmatically?

Yes for 'a' they are, but there are other Unicode characters, which have
more than one value. F.ex the character 'µ' has different values depending
upon if it means the greek character 'my' or the prefix 'micro'. So there is
no guarantee that there will not be added a second code for the same
character.

In practice this could mean that 'µ'!='µ' if the two characters have been
written with different editors. The same problems occurs for other
characters, f.ex 'Å' which is a common character in Danish and Swedish.

Niels Dybdahl

Niels Dybdahl, Sep 8, 2003
5. ### Steven CocoGuest

Thomas G. Marshall wrote:

> I think I see the thrust of your question,

Yeah: I happened to be working on a Chess game problem--where squares
on the board are uniquely identified by "file" and "rank" thusly: a1,
a2, ... h8--and making accessor and mutator methods--as *well* as
algorithms--is made --real fast-- with things like:

'a' + 2 // yields the 'c' file

and the bounds checking:

if (charParam >= 'a' && charParam <= 'h')
ok();

--and you can implement int transformations like

transform(char file, int nFiles) {
(char) (file + nFiles); // given ('a', 2) yields 'c'

--basically using chars as unsigned "shorts"--that they in fact are.

What I found the language spec states is that: When processed: the
character literal 'a' found in the source file will be translated to a
16-bit unsigned integer value--the integral type char; and it's value
will be based on the version of Unicode used by the Java release
interpreting the source file.

What I really don't know--even after reading through the VM spec--is:
what happens when the Unicode version used in the release interpreting
that _class_ file maps 'a' to a different code position. (I found no
menion of "Unicode version" translation--the char value is promoted to
an int and operated upon by int operations throughout the VM.)

I'm going to love this: I'm going to post some bug at java.sun.com and
see if a definition comes back... But FWIW:

Inside of java.lang.Character are constants defining Unicode points
\u0000-\uFFFE; and methods that depend on those. The class
documentation does state that it follows Unicode v 3.0. But if you
happen to write something using that implementation; it's behavior may
be undefined under implementations using other Unicode versions--in
truth, unless the VM translates the value stored in the class file from
your implementing Unicode version to the interpreting version, the
behavior would be undefined.

Peace,
Steev.

----
Yes: Java will be 100000000000000000000000000000000% portable.....

Steven Coco, Sep 8, 2003
6. ### Steven CocoGuest

Niels Dybdahl wrote:

> Yes for 'a' they are, but there are other Unicode characters, which have
> more than one value. F.ex the character 'µ' has different values depending
> upon if it means the greek character 'my' or the prefix 'micro'. So there is
> no guarantee that there will not be added a second code for the same
> character.
>
> In practice this could mean that 'µ'!='µ' if the two characters have been
> written with different editors. The same problems occurs for other
> characters, f.ex 'Å' which is a common character in Danish and Swedish.

Ah! Excellent light shed on the subject.

But I'm still scared whitless because of the uncertainty of handling for
different characters. I'll need to know what gets digested and stored
into the class file to know anything.

Peace,
Steev.

Steven Coco, Sep 8, 2003
7. ### Roedy GreenGuest

On Mon, 08 Sep 2003 21:55:35 GMT, Steven Coco
<> wrote or quoted :

>maps 'a' to a different code position.

Even if the spec does not specifically mention the mappings of a-z A-Z
and 0-9 it would break all kinds of programs if some other encoding
incompatible with ASCII-7 assignments were presumed.

Just the same, should still write code that does not presume a

char digit = '3';
int num = digit - '0';

rather than
int num = digit - 0x30;

There are codes such as EBCDIC where a-z and A-Z are not contiguous.
In other words if you write:

sum = 0;
for ( char c = 'a'; c <= 'z'; c++ )
{
sum++;
}

sum won't necessarily be 26.

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Roedy Green, Sep 9, 2003
8. ### Thomas G. MarshallGuest

"Steven Coco" <> wrote in message
news:dm77b.6074\$...
> Niels Dybdahl wrote:
>
> > Yes for 'a' they are, but there are other Unicode characters, which have
> > more than one value. F.ex the character 'µ' has different values

depending
> > upon if it means the greek character 'my' or the prefix 'micro'. So

there is
> > no guarantee that there will not be added a second code for the same
> > character.
> >
> > In practice this could mean that 'µ'!='µ' if the two characters have

been
> > written with different editors. The same problems occurs for other
> > characters, f.ex 'Å' which is a common character in Danish and Swedish.

>
> Ah! Excellent light shed on the subject.
>
> But I'm still scared whitless because of the uncertainty of handling for
> different characters. I'll need to know what gets digested and stored
> into the class file to know anything.

I'd suggest just not worrying about it.

Assuming that the unicode character set is static and omnipresent is not so
dangerous an idea.

If someone complains and tries to point out that there are other character
sets, just smile smugly and ask them to prove it

lol

Thomas G. Marshall, Sep 9, 2003
9. ### Steven CocoGuest

Roedy Green wrote:

>> maps 'a' to a different code position.

>
> Even if the spec does not specifically mention the mappings of a-z A-Z
> and 0-9 it would break all kinds of programs if some other encoding
> incompatible with ASCII-7 assignments were presumed.

This is true; but unless the behavior is well-defined, your code would
be technically not programmatically correct.

> Just the same, should still write code that does not presume a
>
> char digit = '3';
> int num = digit - '0';

There would appear to be a better way to do this particular thing--where
you have actually numbers in the chars: Character.getNumericValue(char
ch)--returns the character's "numeric value".

Peace,
Steev.

Steven Coco, Sep 9, 2003
10. ### Roedy GreenGuest

On Tue, 09 Sep 2003 01:32:52 GMT, Steven Coco
<> wrote or quoted :

>: Character.getNumericValue(char
>ch)--returns the character's "numeric value".

If you peek inside, you will see the code is NOT just - '0', which in
99% of the cases is all you want.

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Roedy Green, Sep 9, 2003
11. ### Steven CocoGuest

Thomas G. Marshall wrote:

> I'd suggest just not worrying about it.

Tempting.

Peace,
Steev.

Steven Coco, Sep 9, 2003
12. ### John C. BollingerGuest

Steven Coco wrote:
> What I found the language spec states is that: When processed: the
> character literal 'a' found in the source file will be translated to a
> 16-bit unsigned integer value--the integral type char; and it's value
> will be based on the version of Unicode used by the Java release
> interpreting the source file.
>
> What I really don't know--even after reading through the VM spec--is:
> what happens when the Unicode version used in the release interpreting
> that _class_ file maps 'a' to a different code position. (I found no
> menion of "Unicode version" translation--the char value is promoted to
> an int and operated upon by int operations throughout the VM.)

The answer is "nothing unusual." Once the character literal is
converted to an unsigned number for storage in a class file it is
decoupled from its character representation in the source. You just
have a 16-bit unsigned number, whose significance as a character code is
wholly supplied by its context (in those very few circumstances in which
it matters).

Moreover, I think you are quite safe in assuming that future versions of
Unicode will not remap the characters from the ASCII set. For the most
part, what you should worry about is the charset with which the source
is read, and likewise the charsets used when you do character I/O or
other forms of character <-> byte interconversions. (That's in general;
it is unlikely that you will need to worry about such issues with regard

John Bollinger

John C. Bollinger, Sep 9, 2003
13. ### Steven CocoGuest

Roedy Green wrote:

>> : Character.getNumericValue(char
>> ch)--returns the character's "numeric value".

>
> If you peek inside, you will see the code is NOT just - '0', which in
> 99% of the cases is all you want.

I'm not 100% sure what you meant there. What I was sharing is what I
gathered about their intention with that method: It gives you the
facility to get a character's numeric value where it has one. In the
case of '0', it's 0; but it applies to all characters.

For example: The Roman numerals 'X' and 'I' would yield the numeric
values 10 and 1, so you could perform math with those characters through
that method; but using the code points wouldn't accomplish that.

Earth shattering? ...

.. Steven Coco .
.........................................................................
Life is mysterious and we are creepy.

Steven Coco, Sep 10, 2003
14. ### Roedy GreenGuest

On Wed, 10 Sep 2003 02:50:31 GMT, Steven Coco
<> wrote or quoted :

>>> : Character.getNumericValue(char
>>> ch)--returns the character's "numeric value".

>>
>> If you peek inside, you will see the code is NOT just - '0', which in
>> 99% of the cases is all you want.

>
>I'm not 100% sure what you meant there.

it is an elaborate tool that masquerades as a simple convenience
method for a subtraction. 99% of the time the elaborate features would
be surprising not delightful. In any case, you likely don't want this
method in time critical code.

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Roedy Green, Sep 10, 2003
15. ### Steven CocoGuest

You know what's weird? Thinking about all those other programers in
who-knows-what-country that might be writing java files in just about
any language: What are they doing when *they* need to use a character
literal in their source file? You know what I mean?

It's a bunch of messages, but in all of it, I'm just trying to know what
is the intended usage of these facilities so I can fit in the global
community.

--

.. Steven Coco .
.........................................................................
When you're not sure; "Confess your heart" says the Lord, "and you'll be
freed."

Steven Coco, Sep 10, 2003
16. ### John C. BollingerGuest

Steven Coco wrote:
> You know what's weird? Thinking about all those other programers in
> who-knows-what-country that might be writing java files in just about
> any language: What are they doing when *they* need to use a character
> literal in their source file? You know what I mean?

Well, identifiers and string and character literals can contain
arbitrary characters, which is I guess what you mean. Java keywords and
punctuation are universal. That means, for one thing, that a Java
source file cannot be written using a charset that does not support the
ASCII characters (although the encoding need not be congruent with ASCII
for the relevant characters).

> It's a bunch of messages, but in all of it, I'm just trying to know what
> is the intended usage of these facilities so I can fit in the global
> community.

The key is that the Java compiler must know what charset to apply to any
particular source file. If it can choose the correct one, by any means,
then everything is fine. It's really more an issue of the programming
tools than of the language. Once a class is compiled it is independant
of the source, and characters are characters are characters.

John Bollinger

John C. Bollinger, Sep 10, 2003
17. ### Thomas G. MarshallGuest

John C. Bollinger <> horrified us with:

> Steven Coco wrote:
>> You know what's weird? Thinking about all those other programers in
>> who-knows-what-country that might be writing java files in just about
>> any language: What are they doing when *they* need to use a
>> character literal in their source file? You know what I mean?

>
> Well, identifiers and string and character literals can contain
> arbitrary characters, which is I guess what you mean. Java keywords
> and punctuation are universal. That means, for one thing, that a Java
> source file cannot be written using a charset that does not support
> the ASCII characters (although the encoding need not be congruent
> with ASCII for the relevant characters).

I know how you mean this, but I read the JLS slightly differently. I
believe that what it is saying is that the language itself must be
/precisely/ ASCII, not just "congruent" to it.

From JLS, "3.1 Unicode"

Except for comments (§3.7), identifiers, and the
contents of character and string literals (§3.10.4,
§3.10.5), all input elements (§3.5) in a program are
formed /only/ from ASCII characters (or Unicode
escapes (§3.3) which result in ASCII characters).
ASCII (ANSI X3.4) is the American Standard Code
for Information Interchange. The first 128 characters
of the Unicode character encoding are the ASCII
characters.

When it says "only from ASCII characters or Unicode escapes which results in
ASCII characters" it's giving only two possibilities: ASCII propper or the
ASCII that is the start of Unicode.

Thomas G. Marshall, Sep 11, 2003
18. ### Steven CocoGuest

Roedy Green wrote:

> Even if the spec does not specifically mention the mappings of a-z A-Z
> and 0-9 it would break all kinds of programs if some other encoding
> incompatible with ASCII-7 assignments were presumed.

You're probably not following this thread anymore, but I *just* "got"
some of this...

> Just the same, should still write code that does not presume a
>
> char digit = '3';
> int num = digit - '0';
>
> rather than
> int num = digit - 0x30;

The whole question can really be eliminated by knowing these points for
sure:

(a) When you insert a character literal of the form 'a' in your source
file, you are only (?) banking on that actual glyph in your source file
not becoming confused. This really depends more on the glyph than
anything else. Your source encoding will map that glyph to a Unicode
point, and there it stays unless something unexpected happens. (b) Yet
that may be possible with a glyph that is mapped to two code points like
the Greek mu and the micro characters--as somebody posted--both are
mapped to the same glyph so you wouldn't know just by looking. (c)
What's more; even though most people have noted that ascii characters
are stable, predictable, etc., it does not preclude them from the
problem--when in the future the Unicode table contains a code point
labeled "First sub-item in a bulletted list" or something and it is
mapped to the glyph "a", then you have to be careful. In fact, right
now there is something like full-width roman or ascii characters in the
charts which might be able to fool you. (d) When you read your source
file and see int i = 'a', is that "Roman small letter a" or our other
character? That makes it's (integer) value different and would break
your math--which is near the core of my original question.

Now, probably more safe is to restrict yourself to Unicode escape values
only; and not character literals. This way, you are banking only on the
Unicode code charts; and if you used those escapes throughout--including
your documentation--you will have pinned down the characters to numeric
values absolutely! Granted, it won't be as readable, but note this
point too:

I'm also thinking about things like making public classes; so where I've
written a method like this:

/** file must be one of a-h */
public void setLocation(char file, int rank) {
if (file < 'a' || file > 'h')
throw new Exception...

I've got to be sure about the character values throughout; including
what someone else might see on a different machine; so I'm trying to be
definitive--in some way.

In the end, it is up to you; the implementor; to understand and be sure

> There are codes such as EBCDIC where a-z and A-Z are not contiguous.
> In other words if you write:
>
> sum = 0;
> for ( char c = 'a'; c <= 'z'; c++ )
> {
> sum++;
> }
>
> sum won't necessarily be 26.

Fun! The good news is that the ascii code points are almost definitely
not volatile--that is what Unicode themselves say...

Peace; and good luck!
Steev.

--

..Steven Coco.
.........................................................................
When you're not sure:
"Confess your heart" says the Lord, "and you'll be freed."

Steven Coco, Sep 23, 2003
19. ### Thomas G. MarshallGuest

....[thwack!]...

> When you're not sure:
> "Confess your heart" says the Lord, "and you'll be freed."

"Confess thy reference" sayeth the gc, "and the will be freed."

Thomas G. Marshall, Sep 23, 2003
20. ### Roedy GreenGuest

On Tue, 23 Sep 2003 02:18:23 GMT, "Thomas G. Marshall"
<> wrote or quoted
:

>
>"Confess thy reference" sayeth the gc, "and the will be freed."

Renounce thy reference, sayeth the gc, and thou shalt be freed.
All will be freed eventually whether you do or not. Oblivion is
inevitable. Freedom == death.

--