"char math"?

Discussion in 'Java' started by Steven Coco, Sep 8, 2003.

  1. Steven Coco

    Steven Coco Guest

    Can someone definitively answer this:

    Given code such as this:

    char c = 'a';
    // (int) c == 97,
    // int i = 'b' - c == 1,
    // and, c >= 'a' && c <= 'h'

    (1) Are the commented assumptions guaranteed to be true
    programmatically? And,
    (2) Is arithmetic with primitive char values valid programming practice?

    The best I can do is: based on the language and Unicode specs; such
    arithmetic is simply dependent upon the Unicode code code point for 'a'
    never changing (no matter even what character encoding is used for the
    ..java file).

    My "beef" is that: it would seem that the integer value of (char) c is
    dependent upon the Unicode spec.

    Even though the consortium guarantees that the code point for the
    character 'a' will never change for the life of the spec, a reliance on
    that spec would make this type of programming technically unstable.
    This would seem to be scary since java's relational operators can be
    (and are) used with char values as those ponted out above.

    Comments?

    Peace,
    Steev.
    Steven Coco, Sep 8, 2003
    #1
    1. Advertising

  2. Steven Coco

    Roedy Green Guest

    On Mon, 08 Sep 2003 06:02:24 GMT, Steven Coco
    <> wrote or quoted :

    >
    > char c = 'a';
    > // (int) c == 97,
    > // int i = 'b' - c == 1,
    > // and, c >= 'a' && c <= 'h'


    everything works because chars are promoted to int without sign
    extension before almost any operation.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Sep 8, 2003
    #2
    1. Advertising

  3. Steven Coco <> horrified us with:

    > Can someone definitively answer this:
    >
    > Given code such as this:
    >
    > char c = 'a';
    > // (int) c == 97,
    > // int i = 'b' - c == 1,
    > // and, c >= 'a' && c <= 'h'
    >
    > (1) Are the commented assumptions guaranteed to be true
    > programmatically? And,
    > (2) Is arithmetic with primitive char values valid programming
    > practice?
    >
    > The best I can do is: based on the language and Unicode specs; such
    > arithmetic is simply dependent upon the Unicode code code point for
    > 'a' never changing (no matter even what character encoding is used
    > for the .java file).
    >
    > My "beef" is that: it would seem that the integer value of (char) c is
    > dependent upon the Unicode spec.
    >
    > Even though the consortium guarantees that the code point for the
    > character 'a' will never change for the life of the spec, a reliance
    > on that spec would make this type of programming technically unstable.
    > This would seem to be scary since java's relational operators can be
    > (and are) used with char values as those ponted out above.



    I think I see the thrust of your question, and I'm really not sure. You may
    have to just trust that the ISO spec does not change.

    If there were such a thing as Character.a, which is 'a' regardless of where
    the ISO spec places it, then you might be able to use that as a stake in the
    ground. Seems to me I've bumped into this before but I cannot find it.

    See if
    java.lang.Character
    java.lang.Character.Subset
    java.lang.Character.Unicode
    help at all.

    (?)
    Thomas G. Marshall, Sep 8, 2003
    #3
  4. > (1) Are the commented assumptions guaranteed to be true
    > programmatically?


    Yes for 'a' they are, but there are other Unicode characters, which have
    more than one value. F.ex the character 'µ' has different values depending
    upon if it means the greek character 'my' or the prefix 'micro'. So there is
    no guarantee that there will not be added a second code for the same
    character.

    In practice this could mean that 'µ'!='µ' if the two characters have been
    written with different editors. The same problems occurs for other
    characters, f.ex 'Å' which is a common character in Danish and Swedish.

    Niels Dybdahl
    Niels Dybdahl, Sep 8, 2003
    #4
  5. Steven Coco

    Steven Coco Guest

    Thomas G. Marshall wrote:

    > I think I see the thrust of your question,


    Yeah: I happened to be working on a Chess game problem--where squares
    on the board are uniquely identified by "file" and "rank" thusly: a1,
    a2, ... h8--and making accessor and mutator methods--as *well* as
    algorithms--is made --real fast-- with things like:

    'a' + 2 // yields the 'c' file

    and the bounds checking:

    if (charParam >= 'a' && charParam <= 'h')
    ok();

    --and you can implement int transformations like

    transform(char file, int nFiles) {
    (char) (file + nFiles); // given ('a', 2) yields 'c'

    --basically using chars as unsigned "shorts"--that they in fact are.


    What I found the language spec states is that: When processed: the
    character literal 'a' found in the source file will be translated to a
    16-bit unsigned integer value--the integral type char; and it's value
    will be based on the version of Unicode used by the Java release
    interpreting the source file.

    What I really don't know--even after reading through the VM spec--is:
    what happens when the Unicode version used in the release interpreting
    that _class_ file maps 'a' to a different code position. (I found no
    menion of "Unicode version" translation--the char value is promoted to
    an int and operated upon by int operations throughout the VM.)

    I'm going to love this: I'm going to post some bug at java.sun.com and
    see if a definition comes back... But FWIW:

    Inside of java.lang.Character are constants defining Unicode points
    \u0000-\uFFFE; and methods that depend on those. The class
    documentation does state that it follows Unicode v 3.0. But if you
    happen to write something using that implementation; it's behavior may
    be undefined under implementations using other Unicode versions--in
    truth, unless the VM translates the value stored in the class file from
    your implementing Unicode version to the interpreting version, the
    behavior would be undefined.

    Peace,
    Steev.

    ----
    Yes: Java will be 100000000000000000000000000000000% portable.....
    Steven Coco, Sep 8, 2003
    #5
  6. Steven Coco

    Steven Coco Guest

    Niels Dybdahl wrote:

    > Yes for 'a' they are, but there are other Unicode characters, which have
    > more than one value. F.ex the character 'µ' has different values depending
    > upon if it means the greek character 'my' or the prefix 'micro'. So there is
    > no guarantee that there will not be added a second code for the same
    > character.
    >
    > In practice this could mean that 'µ'!='µ' if the two characters have been
    > written with different editors. The same problems occurs for other
    > characters, f.ex 'Å' which is a common character in Danish and Swedish.


    Ah! Excellent light shed on the subject.

    But I'm still scared whitless because of the uncertainty of handling for
    different characters. I'll need to know what gets digested and stored
    into the class file to know anything.

    Peace,
    Steev.
    Steven Coco, Sep 8, 2003
    #6
  7. Steven Coco

    Roedy Green Guest

    On Mon, 08 Sep 2003 21:55:35 GMT, Steven Coco
    <> wrote or quoted :

    >maps 'a' to a different code position.


    Even if the spec does not specifically mention the mappings of a-z A-Z
    and 0-9 it would break all kinds of programs if some other encoding
    incompatible with ASCII-7 assignments were presumed.


    Just the same, should still write code that does not presume a
    particular numeric value, just to make your code readable. E.g.


    char digit = '3';
    int num = digit - '0';

    rather than
    int num = digit - 0x30;

    There are codes such as EBCDIC where a-z and A-Z are not contiguous.
    In other words if you write:

    sum = 0;
    for ( char c = 'a'; c <= 'z'; c++ )
    {
    sum++;
    }

    sum won't necessarily be 26.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Sep 9, 2003
    #7
  8. "Steven Coco" <> wrote in message
    news:dm77b.6074$...
    > Niels Dybdahl wrote:
    >
    > > Yes for 'a' they are, but there are other Unicode characters, which have
    > > more than one value. F.ex the character 'µ' has different values

    depending
    > > upon if it means the greek character 'my' or the prefix 'micro'. So

    there is
    > > no guarantee that there will not be added a second code for the same
    > > character.
    > >
    > > In practice this could mean that 'µ'!='µ' if the two characters have

    been
    > > written with different editors. The same problems occurs for other
    > > characters, f.ex 'Å' which is a common character in Danish and Swedish.

    >
    > Ah! Excellent light shed on the subject.
    >
    > But I'm still scared whitless because of the uncertainty of handling for
    > different characters. I'll need to know what gets digested and stored
    > into the class file to know anything.


    I'd suggest just not worrying about it.

    Assuming that the unicode character set is static and omnipresent is not so
    dangerous an idea.

    If someone complains and tries to point out that there are other character
    sets, just smile smugly and ask them to prove it :)

    lol
    Thomas G. Marshall, Sep 9, 2003
    #8
  9. Steven Coco

    Steven Coco Guest

    Roedy Green wrote:

    >> maps 'a' to a different code position.

    >
    > Even if the spec does not specifically mention the mappings of a-z A-Z
    > and 0-9 it would break all kinds of programs if some other encoding
    > incompatible with ASCII-7 assignments were presumed.


    This is true; but unless the behavior is well-defined, your code would
    be technically not programmatically correct.


    > Just the same, should still write code that does not presume a
    > particular numeric value, just to make your code readable. E.g.
    >
    > char digit = '3';
    > int num = digit - '0';


    There would appear to be a better way to do this particular thing--where
    you have actually numbers in the chars: Character.getNumericValue(char
    ch)--returns the character's "numeric value".

    Peace,
    Steev.
    Steven Coco, Sep 9, 2003
    #9
  10. Steven Coco

    Roedy Green Guest

    On Tue, 09 Sep 2003 01:32:52 GMT, Steven Coco
    <> wrote or quoted :

    >: Character.getNumericValue(char
    >ch)--returns the character's "numeric value".



    If you peek inside, you will see the code is NOT just - '0', which in
    99% of the cases is all you want.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Sep 9, 2003
    #10
  11. Steven Coco

    Steven Coco Guest

    Thomas G. Marshall wrote:

    > I'd suggest just not worrying about it.


    Tempting.

    Peace,
    Steev.
    Steven Coco, Sep 9, 2003
    #11
  12. Steven Coco wrote:
    > What I found the language spec states is that: When processed: the
    > character literal 'a' found in the source file will be translated to a
    > 16-bit unsigned integer value--the integral type char; and it's value
    > will be based on the version of Unicode used by the Java release
    > interpreting the source file.
    >
    > What I really don't know--even after reading through the VM spec--is:
    > what happens when the Unicode version used in the release interpreting
    > that _class_ file maps 'a' to a different code position. (I found no
    > menion of "Unicode version" translation--the char value is promoted to
    > an int and operated upon by int operations throughout the VM.)


    The answer is "nothing unusual." Once the character literal is
    converted to an unsigned number for storage in a class file it is
    decoupled from its character representation in the source. You just
    have a 16-bit unsigned number, whose significance as a character code is
    wholly supplied by its context (in those very few circumstances in which
    it matters).

    Moreover, I think you are quite safe in assuming that future versions of
    Unicode will not remap the characters from the ASCII set. For the most
    part, what you should worry about is the charset with which the source
    is read, and likewise the charsets used when you do character I/O or
    other forms of character <-> byte interconversions. (That's in general;
    it is unlikely that you will need to worry about such issues with regard
    to your specific question.)


    John Bollinger
    John C. Bollinger, Sep 9, 2003
    #12
  13. Steven Coco

    Steven Coco Guest

    Roedy Green wrote:

    >> : Character.getNumericValue(char
    >> ch)--returns the character's "numeric value".

    >
    > If you peek inside, you will see the code is NOT just - '0', which in
    > 99% of the cases is all you want.


    I'm not 100% sure what you meant there. What I was sharing is what I
    gathered about their intention with that method: It gives you the
    facility to get a character's numeric value where it has one. In the
    case of '0', it's 0; but it applies to all characters.

    For example: The Roman numerals 'X' and 'I' would yield the numeric
    values 10 and 1, so you could perform math with those characters through
    that method; but using the code points wouldn't accomplish that.

    Earth shattering? ...

    .. Steven Coco .
    .........................................................................
    Life is mysterious and we are creepy.
    Steven Coco, Sep 10, 2003
    #13
  14. Steven Coco

    Roedy Green Guest

    On Wed, 10 Sep 2003 02:50:31 GMT, Steven Coco
    <> wrote or quoted :

    >>> : Character.getNumericValue(char
    >>> ch)--returns the character's "numeric value".

    >>
    >> If you peek inside, you will see the code is NOT just - '0', which in
    >> 99% of the cases is all you want.

    >
    >I'm not 100% sure what you meant there.


    it is an elaborate tool that masquerades as a simple convenience
    method for a subtraction. 99% of the time the elaborate features would
    be surprising not delightful. In any case, you likely don't want this
    method in time critical code.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Sep 10, 2003
    #14
  15. Steven Coco

    Steven Coco Guest

    You know what's weird? Thinking about all those other programers in
    who-knows-what-country that might be writing java files in just about
    any language: What are they doing when *they* need to use a character
    literal in their source file? You know what I mean?

    It's a bunch of messages, but in all of it, I'm just trying to know what
    is the intended usage of these facilities so I can fit in the global
    community.

    --

    .. Steven Coco .
    .........................................................................
    When you're not sure; "Confess your heart" says the Lord, "and you'll be
    freed."
    Steven Coco, Sep 10, 2003
    #15
  16. Steven Coco wrote:
    > You know what's weird? Thinking about all those other programers in
    > who-knows-what-country that might be writing java files in just about
    > any language: What are they doing when *they* need to use a character
    > literal in their source file? You know what I mean?


    Well, identifiers and string and character literals can contain
    arbitrary characters, which is I guess what you mean. Java keywords and
    punctuation are universal. That means, for one thing, that a Java
    source file cannot be written using a charset that does not support the
    ASCII characters (although the encoding need not be congruent with ASCII
    for the relevant characters).

    > It's a bunch of messages, but in all of it, I'm just trying to know what
    > is the intended usage of these facilities so I can fit in the global
    > community.


    The key is that the Java compiler must know what charset to apply to any
    particular source file. If it can choose the correct one, by any means,
    then everything is fine. It's really more an issue of the programming
    tools than of the language. Once a class is compiled it is independant
    of the source, and characters are characters are characters.


    John Bollinger
    John C. Bollinger, Sep 10, 2003
    #16
  17. John C. Bollinger <> horrified us with:

    > Steven Coco wrote:
    >> You know what's weird? Thinking about all those other programers in
    >> who-knows-what-country that might be writing java files in just about
    >> any language: What are they doing when *they* need to use a
    >> character literal in their source file? You know what I mean?

    >
    > Well, identifiers and string and character literals can contain
    > arbitrary characters, which is I guess what you mean. Java keywords
    > and punctuation are universal. That means, for one thing, that a Java
    > source file cannot be written using a charset that does not support
    > the ASCII characters (although the encoding need not be congruent
    > with ASCII for the relevant characters).


    I know how you mean this, but I read the JLS slightly differently. I
    believe that what it is saying is that the language itself must be
    /precisely/ ASCII, not just "congruent" to it.

    From JLS, "3.1 Unicode"

    Except for comments (§3.7), identifiers, and the
    contents of character and string literals (§3.10.4,
    §3.10.5), all input elements (§3.5) in a program are
    formed /only/ from ASCII characters (or Unicode
    escapes (§3.3) which result in ASCII characters).
    ASCII (ANSI X3.4) is the American Standard Code
    for Information Interchange. The first 128 characters
    of the Unicode character encoding are the ASCII
    characters.

    When it says "only from ASCII characters or Unicode escapes which results in
    ASCII characters" it's giving only two possibilities: ASCII propper or the
    ASCII that is the start of Unicode.
    Thomas G. Marshall, Sep 11, 2003
    #17
  18. Steven Coco

    Steven Coco Guest

    Roedy Green wrote:

    > Even if the spec does not specifically mention the mappings of a-z A-Z
    > and 0-9 it would break all kinds of programs if some other encoding
    > incompatible with ASCII-7 assignments were presumed.


    You're probably not following this thread anymore, but I *just* "got"
    some of this...


    > Just the same, should still write code that does not presume a
    > particular numeric value, just to make your code readable. E.g.
    >
    > char digit = '3';
    > int num = digit - '0';
    >
    > rather than
    > int num = digit - 0x30;


    The whole question can really be eliminated by knowing these points for
    sure:

    (a) When you insert a character literal of the form 'a' in your source
    file, you are only (?) banking on that actual glyph in your source file
    not becoming confused. This really depends more on the glyph than
    anything else. Your source encoding will map that glyph to a Unicode
    point, and there it stays unless something unexpected happens. (b) Yet
    that may be possible with a glyph that is mapped to two code points like
    the Greek mu and the micro characters--as somebody posted--both are
    mapped to the same glyph so you wouldn't know just by looking. (c)
    What's more; even though most people have noted that ascii characters
    are stable, predictable, etc., it does not preclude them from the
    problem--when in the future the Unicode table contains a code point
    labeled "First sub-item in a bulletted list" or something and it is
    mapped to the glyph "a", then you have to be careful. In fact, right
    now there is something like full-width roman or ascii characters in the
    charts which might be able to fool you. (d) When you read your source
    file and see int i = 'a', is that "Roman small letter a" or our other
    character? That makes it's (integer) value different and would break
    your math--which is near the core of my original question.

    Now, probably more safe is to restrict yourself to Unicode escape values
    only; and not character literals. This way, you are banking only on the
    Unicode code charts; and if you used those escapes throughout--including
    your documentation--you will have pinned down the characters to numeric
    values absolutely! Granted, it won't be as readable, but note this
    point too:

    I'm also thinking about things like making public classes; so where I've
    written a method like this:

    /** file must be one of a-h */
    public void setLocation(char file, int rank) {
    if (file < 'a' || file > 'h')
    throw new Exception...

    I've got to be sure about the character values throughout; including
    what someone else might see on a different machine; so I'm trying to be
    definitive--in some way.

    In the end, it is up to you; the implementor; to understand and be sure
    about what you write.

    > There are codes such as EBCDIC where a-z and A-Z are not contiguous.
    > In other words if you write:
    >
    > sum = 0;
    > for ( char c = 'a'; c <= 'z'; c++ )
    > {
    > sum++;
    > }
    >
    > sum won't necessarily be 26.


    Fun! The good news is that the ascii code points are almost definitely
    not volatile--that is what Unicode themselves say...

    Peace; and good luck!
    Steev.

    --

    ..Steven Coco.
    .........................................................................
    When you're not sure:
    "Confess your heart" says the Lord, "and you'll be freed."
    Steven Coco, Sep 23, 2003
    #18
  19. ....[thwack!]...

    > When you're not sure:
    > "Confess your heart" says the Lord, "and you'll be freed."



    "Confess thy reference" sayeth the gc, "and the will be freed."
    Thomas G. Marshall, Sep 23, 2003
    #19
  20. Steven Coco

    Roedy Green Guest

    On Tue, 23 Sep 2003 02:18:23 GMT, "Thomas G. Marshall"
    <> wrote or quoted
    :

    >
    >"Confess thy reference" sayeth the gc, "and the will be freed."


    Renounce thy reference, sayeth the gc, and thou shalt be freed.
    All will be freed eventually whether you do or not. Oblivion is
    inevitable. Freedom == death.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Sep 23, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mdh

    char math?

    mdh, Jun 5, 2006, in forum: C Programming
    Replies:
    2
    Views:
    485
Loading...

Share This Page