"char math"?

S

Steven Coco

Can someone definitively answer this:

Given code such as this:

char c = 'a';
// (int) c == 97,
// int i = 'b' - c == 1,
// and, c >= 'a' && c <= 'h'

(1) Are the commented assumptions guaranteed to be true
programmatically? And,
(2) Is arithmetic with primitive char values valid programming practice?

The best I can do is: based on the language and Unicode specs; such
arithmetic is simply dependent upon the Unicode code code point for 'a'
never changing (no matter even what character encoding is used for the
..java file).

My "beef" is that: it would seem that the integer value of (char) c is
dependent upon the Unicode spec.

Even though the consortium guarantees that the code point for the
character 'a' will never change for the life of the spec, a reliance on
that spec would make this type of programming technically unstable.
This would seem to be scary since java's relational operators can be
(and are) used with char values as those ponted out above.

Comments?

Peace,
Steev.
 
R

Roedy Green

char c = 'a';
// (int) c == 97,
// int i = 'b' - c == 1,
// and, c >= 'a' && c <= 'h'

everything works because chars are promoted to int without sign
extension before almost any operation.
 
T

Thomas G. Marshall

Steven Coco said:
Can someone definitively answer this:

Given code such as this:

char c = 'a';
// (int) c == 97,
// int i = 'b' - c == 1,
// and, c >= 'a' && c <= 'h'

(1) Are the commented assumptions guaranteed to be true
programmatically? And,
(2) Is arithmetic with primitive char values valid programming
practice?

The best I can do is: based on the language and Unicode specs; such
arithmetic is simply dependent upon the Unicode code code point for
'a' never changing (no matter even what character encoding is used
for the .java file).

My "beef" is that: it would seem that the integer value of (char) c is
dependent upon the Unicode spec.

Even though the consortium guarantees that the code point for the
character 'a' will never change for the life of the spec, a reliance
on that spec would make this type of programming technically unstable.
This would seem to be scary since java's relational operators can be
(and are) used with char values as those ponted out above.


I think I see the thrust of your question, and I'm really not sure. You may
have to just trust that the ISO spec does not change.

If there were such a thing as Character.a, which is 'a' regardless of where
the ISO spec places it, then you might be able to use that as a stake in the
ground. Seems to me I've bumped into this before but I cannot find it.

See if
java.lang.Character
java.lang.Character.Subset
java.lang.Character.Unicode
help at all.

(?)
 
N

Niels Dybdahl

(1) Are the commented assumptions guaranteed to be true
programmatically?

Yes for 'a' they are, but there are other Unicode characters, which have
more than one value. F.ex the character 'µ' has different values depending
upon if it means the greek character 'my' or the prefix 'micro'. So there is
no guarantee that there will not be added a second code for the same
character.

In practice this could mean that 'µ'!='µ' if the two characters have been
written with different editors. The same problems occurs for other
characters, f.ex 'Å' which is a common character in Danish and Swedish.

Niels Dybdahl
 
S

Steven Coco

Thomas said:
I think I see the thrust of your question,

Yeah: I happened to be working on a Chess game problem--where squares
on the board are uniquely identified by "file" and "rank" thusly: a1,
a2, ... h8--and making accessor and mutator methods--as *well* as
algorithms--is made --real fast-- with things like:

'a' + 2 // yields the 'c' file

and the bounds checking:

if (charParam >= 'a' && charParam <= 'h')
ok();

--and you can implement int transformations like

transform(char file, int nFiles) {
(char) (file + nFiles); // given ('a', 2) yields 'c'

--basically using chars as unsigned "shorts"--that they in fact are.


What I found the language spec states is that: When processed: the
character literal 'a' found in the source file will be translated to a
16-bit unsigned integer value--the integral type char; and it's value
will be based on the version of Unicode used by the Java release
interpreting the source file.

What I really don't know--even after reading through the VM spec--is:
what happens when the Unicode version used in the release interpreting
that _class_ file maps 'a' to a different code position. (I found no
menion of "Unicode version" translation--the char value is promoted to
an int and operated upon by int operations throughout the VM.)

I'm going to love this: I'm going to post some bug at java.sun.com and
see if a definition comes back... But FWIW:

Inside of java.lang.Character are constants defining Unicode points
\u0000-\uFFFE; and methods that depend on those. The class
documentation does state that it follows Unicode v 3.0. But if you
happen to write something using that implementation; it's behavior may
be undefined under implementations using other Unicode versions--in
truth, unless the VM translates the value stored in the class file from
your implementing Unicode version to the interpreting version, the
behavior would be undefined.

Peace,
Steev.
 
S

Steven Coco

Niels said:
Yes for 'a' they are, but there are other Unicode characters, which have
more than one value. F.ex the character 'µ' has different values depending
upon if it means the greek character 'my' or the prefix 'micro'. So there is
no guarantee that there will not be added a second code for the same
character.

In practice this could mean that 'µ'!='µ' if the two characters have been
written with different editors. The same problems occurs for other
characters, f.ex 'Å' which is a common character in Danish and Swedish.

Ah! Excellent light shed on the subject.

But I'm still scared whitless because of the uncertainty of handling for
different characters. I'll need to know what gets digested and stored
into the class file to know anything.

Peace,
Steev.
 
R

Roedy Green

maps 'a' to a different code position.

Even if the spec does not specifically mention the mappings of a-z A-Z
and 0-9 it would break all kinds of programs if some other encoding
incompatible with ASCII-7 assignments were presumed.


Just the same, should still write code that does not presume a
particular numeric value, just to make your code readable. E.g.


char digit = '3';
int num = digit - '0';

rather than
int num = digit - 0x30;

There are codes such as EBCDIC where a-z and A-Z are not contiguous.
In other words if you write:

sum = 0;
for ( char c = 'a'; c <= 'z'; c++ )
{
sum++;
}

sum won't necessarily be 26.
 
T

Thomas G. Marshall

Steven Coco said:
Ah! Excellent light shed on the subject.

But I'm still scared whitless because of the uncertainty of handling for
different characters. I'll need to know what gets digested and stored
into the class file to know anything.

I'd suggest just not worrying about it.

Assuming that the unicode character set is static and omnipresent is not so
dangerous an idea.

If someone complains and tries to point out that there are other character
sets, just smile smugly and ask them to prove it :)

lol
 
S

Steven Coco

Roedy said:
Even if the spec does not specifically mention the mappings of a-z A-Z
and 0-9 it would break all kinds of programs if some other encoding
incompatible with ASCII-7 assignments were presumed.

This is true; but unless the behavior is well-defined, your code would
be technically not programmatically correct.

Just the same, should still write code that does not presume a
particular numeric value, just to make your code readable. E.g.

char digit = '3';
int num = digit - '0';

There would appear to be a better way to do this particular thing--where
you have actually numbers in the chars: Character.getNumericValue(char
ch)--returns the character's "numeric value".

Peace,
Steev.
 
R

Roedy Green

: Character.getNumericValue(char
ch)--returns the character's "numeric value".


If you peek inside, you will see the code is NOT just - '0', which in
99% of the cases is all you want.
 
J

John C. Bollinger

Steven said:
What I found the language spec states is that: When processed: the
character literal 'a' found in the source file will be translated to a
16-bit unsigned integer value--the integral type char; and it's value
will be based on the version of Unicode used by the Java release
interpreting the source file.

What I really don't know--even after reading through the VM spec--is:
what happens when the Unicode version used in the release interpreting
that _class_ file maps 'a' to a different code position. (I found no
menion of "Unicode version" translation--the char value is promoted to
an int and operated upon by int operations throughout the VM.)

The answer is "nothing unusual." Once the character literal is
converted to an unsigned number for storage in a class file it is
decoupled from its character representation in the source. You just
have a 16-bit unsigned number, whose significance as a character code is
wholly supplied by its context (in those very few circumstances in which
it matters).

Moreover, I think you are quite safe in assuming that future versions of
Unicode will not remap the characters from the ASCII set. For the most
part, what you should worry about is the charset with which the source
is read, and likewise the charsets used when you do character I/O or
other forms of character <-> byte interconversions. (That's in general;
it is unlikely that you will need to worry about such issues with regard
to your specific question.)


John Bollinger
(e-mail address removed)
 
S

Steven Coco

Roedy said:
If you peek inside, you will see the code is NOT just - '0', which in
99% of the cases is all you want.

I'm not 100% sure what you meant there. What I was sharing is what I
gathered about their intention with that method: It gives you the
facility to get a character's numeric value where it has one. In the
case of '0', it's 0; but it applies to all characters.

For example: The Roman numerals 'X' and 'I' would yield the numeric
values 10 and 1, so you could perform math with those characters through
that method; but using the code points wouldn't accomplish that.

Earth shattering? ...

.. Steven Coco .
.........................................................................
Life is mysterious and we are creepy.
 
R

Roedy Green

I'm not 100% sure what you meant there.

it is an elaborate tool that masquerades as a simple convenience
method for a subtraction. 99% of the time the elaborate features would
be surprising not delightful. In any case, you likely don't want this
method in time critical code.
 
S

Steven Coco

You know what's weird? Thinking about all those other programers in
who-knows-what-country that might be writing java files in just about
any language: What are they doing when *they* need to use a character
literal in their source file? You know what I mean?

It's a bunch of messages, but in all of it, I'm just trying to know what
is the intended usage of these facilities so I can fit in the global
community.

--

.. Steven Coco .
.........................................................................
When you're not sure; "Confess your heart" says the Lord, "and you'll be
freed."
 
J

John C. Bollinger

Steven said:
You know what's weird? Thinking about all those other programers in
who-knows-what-country that might be writing java files in just about
any language: What are they doing when *they* need to use a character
literal in their source file? You know what I mean?

Well, identifiers and string and character literals can contain
arbitrary characters, which is I guess what you mean. Java keywords and
punctuation are universal. That means, for one thing, that a Java
source file cannot be written using a charset that does not support the
ASCII characters (although the encoding need not be congruent with ASCII
for the relevant characters).
It's a bunch of messages, but in all of it, I'm just trying to know what
is the intended usage of these facilities so I can fit in the global
community.

The key is that the Java compiler must know what charset to apply to any
particular source file. If it can choose the correct one, by any means,
then everything is fine. It's really more an issue of the programming
tools than of the language. Once a class is compiled it is independant
of the source, and characters are characters are characters.


John Bollinger
(e-mail address removed)
 
T

Thomas G. Marshall

John C. Bollinger said:
Well, identifiers and string and character literals can contain
arbitrary characters, which is I guess what you mean. Java keywords
and punctuation are universal. That means, for one thing, that a Java
source file cannot be written using a charset that does not support
the ASCII characters (although the encoding need not be congruent
with ASCII for the relevant characters).

I know how you mean this, but I read the JLS slightly differently. I
believe that what it is saying is that the language itself must be
/precisely/ ASCII, not just "congruent" to it.

From JLS, "3.1 Unicode"

Except for comments (§3.7), identifiers, and the
contents of character and string literals (§3.10.4,
§3.10.5), all input elements (§3.5) in a program are
formed /only/ from ASCII characters (or Unicode
escapes (§3.3) which result in ASCII characters).
ASCII (ANSI X3.4) is the American Standard Code
for Information Interchange. The first 128 characters
of the Unicode character encoding are the ASCII
characters.

When it says "only from ASCII characters or Unicode escapes which results in
ASCII characters" it's giving only two possibilities: ASCII propper or the
ASCII that is the start of Unicode.
 
S

Steven Coco

Roedy said:
Even if the spec does not specifically mention the mappings of a-z A-Z
and 0-9 it would break all kinds of programs if some other encoding
incompatible with ASCII-7 assignments were presumed.

You're probably not following this thread anymore, but I *just* "got"
some of this...

Just the same, should still write code that does not presume a
particular numeric value, just to make your code readable. E.g.

char digit = '3';
int num = digit - '0';

rather than
int num = digit - 0x30;

The whole question can really be eliminated by knowing these points for
sure:

(a) When you insert a character literal of the form 'a' in your source
file, you are only (?) banking on that actual glyph in your source file
not becoming confused. This really depends more on the glyph than
anything else. Your source encoding will map that glyph to a Unicode
point, and there it stays unless something unexpected happens. (b) Yet
that may be possible with a glyph that is mapped to two code points like
the Greek mu and the micro characters--as somebody posted--both are
mapped to the same glyph so you wouldn't know just by looking. (c)
What's more; even though most people have noted that ascii characters
are stable, predictable, etc., it does not preclude them from the
problem--when in the future the Unicode table contains a code point
labeled "First sub-item in a bulletted list" or something and it is
mapped to the glyph "a", then you have to be careful. In fact, right
now there is something like full-width roman or ascii characters in the
charts which might be able to fool you. (d) When you read your source
file and see int i = 'a', is that "Roman small letter a" or our other
character? That makes it's (integer) value different and would break
your math--which is near the core of my original question.

Now, probably more safe is to restrict yourself to Unicode escape values
only; and not character literals. This way, you are banking only on the
Unicode code charts; and if you used those escapes throughout--including
your documentation--you will have pinned down the characters to numeric
values absolutely! Granted, it won't be as readable, but note this
point too:

I'm also thinking about things like making public classes; so where I've
written a method like this:

/** file must be one of a-h */
public void setLocation(char file, int rank) {
if (file < 'a' || file > 'h')
throw new Exception...

I've got to be sure about the character values throughout; including
what someone else might see on a different machine; so I'm trying to be
definitive--in some way.

In the end, it is up to you; the implementor; to understand and be sure
about what you write.
There are codes such as EBCDIC where a-z and A-Z are not contiguous.
In other words if you write:

sum = 0;
for ( char c = 'a'; c <= 'z'; c++ )
{
sum++;
}

sum won't necessarily be 26.

Fun! The good news is that the ascii code points are almost definitely
not volatile--that is what Unicode themselves say...

Peace; and good luck!
Steev.

--

..Steven Coco.
.........................................................................
When you're not sure:
"Confess your heart" says the Lord, "and you'll be freed."
 
T

Thomas G. Marshall

....[thwack!]...
When you're not sure:
"Confess your heart" says the Lord, "and you'll be freed."


"Confess thy reference" sayeth the gc, "and the will be freed."
 
R

Roedy Green

"Confess thy reference" sayeth the gc, "and the will be freed."

Renounce thy reference, sayeth the gc, and thou shalt be freed.
All will be freed eventually whether you do or not. Oblivion is
inevitable. Freedom == death.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top