32-bit characters in Java string literals

Discussion in 'Java' started by Roedy Green, Dec 22, 2009.

  1. Roedy Green

    Roedy Green Guest

    Let's say you wanted to include some 32-bit characters in Java String
    literals.

    I understand what the stream would look like in UTF-8 or a int[], but
    what I am curious about is the cleanest way to create string literals
    in a Java program containing such awkward characters.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    If you think it’s expensive to hire a professional to do the job, wait until you hire an amateur.
    ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
     
    Roedy Green, Dec 22, 2009
    #1
    1. Advertising

  2. Roedy Green

    Roedy Green Guest

    On 22 Dec 2009 20:47:39 GMT, Thomas Pornin <> wrote,
    quoted or indirectly quoted someone who said :

    >E.g., if you want to have a String literal with U+10C22 (that's
    >OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
    >then you first convert 0x10C22 to a surrogate pair:
    > 1. subtract 0x10000: you get 0xC22
    > 2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
    > (i.e. (u << 10) + l == 0xC22)
    > 3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.


    That is what I was afraid of. I am doing that now to generate tables
    of char entities and the equivalent hex and \u entities on various
    pages of mindprod.com, e.g. http://mindprod.com/jgloss/html5.html
    which shows the new HTML entities in HTML 5.

    here is my code:

    final int extract = theCharNumber - 0x10000;
    final int high = ( extract >>> 10 ) + 0xd800;
    final int low = ( extract & 0x3ff ) + 0xdc00;
    sb.append( "&quot;\\u" );
    sb.append( StringTools.toLzHexString( high, 4 ) );
    sb.append( "\\u" );
    sb.append( StringTools.toLzHexString( low, 4 ) );
    sb.append( "&quot;" );


    I started to think about what would be needed to make this less
    onerous.

    1. an applet to convert hex to a surrogate pair.

    2. allow \u12345 in string literals. However that would break
    existing code. \u12345 currently means
    "\u1234" + "5".

    3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
    needs a terminator, so that in future it could also handle \c123456;
    I don't know what that might break.

    4. Introduce 32-bit CodePoint string literals with extensible \u
    mechanism. E.g. CString b = c"\u12345;Hello";

    5. specify weird chars with named entities to make the code more
    readable. Entities in String literals would be translated to binary
    at compile time, so the entities would not exist at run-time. The
    HTML 5 set would be greatly extended to give pretty well every Unicode
    glyph a name.

    P.S. I have been poking around in HTML 5. W3C did an odd thing. They
    REDEFINED the entities &lang; and &rang; to different glyphs from HTML
    4. I don't think they have ever done anything like that before. I
    hope it was just an error. I have written the W3C asking if they
    really meant to do that.



    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    If you think it’s expensive to hire a professional to do the job, wait until you hire an amateur.
    ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
     
    Roedy Green, Dec 23, 2009
    #2
    1. Advertising

  3. Roedy Green

    Roedy Green Guest

    On Tue, 22 Dec 2009 18:01:17 -0800, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I started to think about what would be needed to make this less
    >onerous.


    If you had only a few, you could create library of named constants for
    them, and glue them together with compile time concatenation. With
    only a little cleverness, a compiler would avoid embedding constants
    it did not use.


    Is any OS, JVM, utility, browser etc. capable of rendering a code
    point above 0xffff? I get the impression all we can do is embed them
    in UTF-8 files.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    If you think it’s expensive to hire a professional to do the job, wait until you hire an amateur.
    ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
     
    Roedy Green, Dec 23, 2009
    #3
  4. Thomas Pornin <> quoted the JLS section 3.1:
    ><< The Unicode standard was originally designed as a fixed-width 16-bit
    > character encoding. It has since been changed to allow for characters
    > whose representation requires more than 16 bits. The range of legal
    > code points is now U+0000 to U+10FFFF


    I have problems understanding why the surrogate code points are counted
    twice: once as their code points isolated and then again as the code-points
    that are reached by an adjacent pair of them.

    In my understanding that would make 0x10F7FF really legal codepoints, as
    the surrogates wouldn't be legal as single code points, but only as pairs.

    But then again, perhaps my own understanding of "legal code points" just
    differs from some common definition.
     
    Andreas Leitgeb, Dec 23, 2009
    #4
  5. Roedy Green

    Mayeul Guest

    Andreas Leitgeb wrote:
    > Thomas Pornin <> quoted the JLS section 3.1:
    >> << The Unicode standard was originally designed as a fixed-width 16-bit
    >> character encoding. It has since been changed to allow for characters
    >> whose representation requires more than 16 bits. The range of legal
    >> code points is now U+0000 to U+10FFFF

    >
    > I have problems understanding why the surrogate code points are counted
    > twice: once as their code points isolated and then again as the code-points
    > that are reached by an adjacent pair of them.


    It makes defining UTF-16 easy and less error-prone.

    Yet I guess the range of legal codepoints is still be U+0000 to
    U+10FFFF, excluding the surrogates range in the middle.

    --
    Mayeul
     
    Mayeul, Dec 23, 2009
    #5
  6. Roedy Green

    Tom Anderson Guest

    On Wed, 23 Dec 2009, Andreas Leitgeb wrote:

    > Thomas Pornin <> quoted the JLS section 3.1:
    >> << The Unicode standard was originally designed as a fixed-width 16-bit
    >> character encoding. It has since been changed to allow for characters
    >> whose representation requires more than 16 bits. The range of legal
    >> code points is now U+0000 to U+10FFFF

    >
    > I have problems understanding why the surrogate code points are counted
    > twice: once as their code points isolated and then again as the code-points
    > that are reached by an adjacent pair of them.


    The range is a bound - all legal code points are inside it. It doesn't
    mean that all numbers inside it are legal code points. There are plenty of
    numbers which aren't mapped to any character, and so aren't legal code
    points - the surrogates are just a special case of those. I reckon.

    tom

    --
    X is for ... EXECUTION!
     
    Tom Anderson, Dec 23, 2009
    #6
  7. Tom Anderson <> wrote:
    >> Thomas Pornin <> quoted the JLS section 3.1:
    >>> << The Unicode standard was originally designed as a fixed-width 16-bit
    >>> character encoding. It has since been changed to allow for characters
    >>> whose representation requires more than 16 bits. The range of legal
    >>> code points is now U+0000 to U+10FFFF

    >> I have problems understanding why the surrogate code points are counted
    >> twice: once as their code points isolated and then again as the code-points
    >> that are reached by an adjacent pair of them.

    > The range is a bound - all legal code points are inside it. It doesn't
    > mean that all numbers inside it are legal code points. There are plenty of
    > numbers which aren't mapped to any character, and so aren't legal code
    > points - the surrogates are just a special case of those. I reckon.


    Thanks, that was my catch: I somehow mistakenly took "range" as implying
    "all in the range" - and a codepoint with no char mapped to it wasn't
    necessarily illegal in my mind, but single surrogate was.
     
    Andreas Leitgeb, Dec 23, 2009
    #7
  8. Roedy Green

    Roedy Green Guest

    On Wed, 23 Dec 2009 09:40:04 +0000, Steven Simpson <>
    wrote, quoted or indirectly quoted someone who said :

    >
    >IIRC, C99 introduced \uXXXX and \UXXXXXXXX.


    It would make sense to follow suit. Life is complicated enough already
    for people who code in more than one language each day.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    If you think it’s expensive to hire a professional to do the job, wait until you hire an amateur.
    ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
     
    Roedy Green, Dec 23, 2009
    #8
  9. On 2009-12-23 03:30:29 -0500, Roedy Green
    <> said:

    > On Tue, 22 Dec 2009 18:01:17 -0800, Roedy Green
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> I started to think about what would be needed to make this less
    >> onerous.

    >
    > If you had only a few, you could create library of named constants for
    > them, and glue them together with compile time concatenation. With
    > only a little cleverness, a compiler would avoid embedding constants
    > it did not use.
    >
    >
    > Is any OS, JVM, utility, browser etc. capable of rendering a code
    > point above 0xffff? I get the impression all we can do is embed them
    > in UTF-8 files.


    OS X comes with fonts that contain glyphs for some (but not all)
    characters above U+FFFF out of the box, and can render them anywhere
    they appear. Their visibility in Swing apps depends heavily on the L&F;
    if you don't force it, Java will default to the Aqua L&F and render
    most things correctly.

    Webapps, obviously, render nothing; they send encoded characters to
    other things, which may render them. Safari, Chrome, and Firefox can
    all render U+1D360 (COUNTING ROD UNIT DIGIT ONE).

    In the interests of science, what characters do you see on the next line?

    ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

    This message is encoded as UTF-8, and those should be, in order,

    Codepoint (UTF-8 representation) NAME
    U+10100 (F0 90 84 80) AGEAN WORD SEPARATOR LINE
    U+10140 (F0 90 85 80) GREEK ACROPHONIC ATTIC ONE QUARTER
    U+10190 (F0 90 86 90) ROMAN SEXTANS SIGN
    U+10300 (F0 90 8C 80) OLD ITALIC LETTER A
    U+10400 (F0 90 90 80) DESERET CAPITAL LETTER LONG I
    U+10450 (F0 90 91 90) SHAVIAN LETTER PEEP
    U+1D121 (F0 9D 84 A1) MUSICAL SYMBOL C CLEF

    with spaces between.

    Cheers,
    -o
     
    Owen Jacobson, Dec 24, 2009
    #9
  10. Roedy Green

    Guest

    On Dec 24, 4:55 am, Owen Jacobson <> wrote:
    ....
    > In the interests of science, what characters do you see on the next line?
    >
    > ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡


    Debian Lenny / browser Iceweasel 3.0.6 (Firefox re-branded for true
    freedom ;)
    I see boxes with tiny hexcode in them not corresponding to the
    characters.

    But then I can select them, past them in an xterm, where I see all
    '? ? ? ? ?'
    thinggies but then the file I pasted them in the terminal (using cat >
    aa.txt)
    contains the correct characters, as shown by an hexdump:

    $ hexdump aa.txt
    0000000 90f0 8084 f020 8590 2080 90f0 9086 f020
    0000010 8c90 2080 90f0 8090 f020 9190 2090 9df0
    0000020 a184 000a

    :)
     
    , Dec 24, 2009
    #10
  11. Roedy Green

    Guest

    On Dec 22, 9:47 pm, Thomas Pornin <> wrote:
    > ...
    > ...(ASCII works everywhere...


    This

    Here we've got a mix of Windows, Linux and OS X
    devs so we're using scripts called at (Ant) build time that
    enforces that all .java files:

    a) use a subset of ASCII in their name
    b) contains only ASCII characters

    You can't build an app with non-ASCII characters in our
    ..java files and you certainly can't commit them :)

    It's in the guidelines.

    Better safe than sorry :)
     
    , Dec 24, 2009
    #11
  12. Roedy Green

    Tom Anderson Guest

    On Wed, 23 Dec 2009, Owen Jacobson wrote:

    > In the interests of science, what characters do you see on the next line?
    >
    > ? ? ? ? ? ? ?


    Seven question marks.

    Using Alpine 1.10 on Debian 5.0.3 accessed over OpenSSH 5.1p1 from iTerm
    0.10 on OS X 10.4.11. Plus a few more layers i've forgotten, probably.
    Easily enough for one of them to drop the unicode ball somewhere!

    tom

    --
    science fiction, old TV shows, sports, food, New York City topography,
    and golden age hiphop
     
    Tom Anderson, Dec 24, 2009
    #12
  13. Roedy Green

    markspace Guest

    Owen Jacobson wrote:

    > In the interests of science, what characters do you see on the next line?
    >
    > ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡


    6 question marks and a [1/4].

    I bet this has more to do with the news server we're each using than our
    client's OS or newsreader. Vista/Thunderbird here.
     
    markspace, Dec 24, 2009
    #13
  14. Roedy Green

    Roedy Green Guest

    On Wed, 23 Dec 2009 22:55:48 -0500, Owen Jacobson
    <> wrote, quoted or indirectly quoted someone
    who said :

    >
    >In the interests of science, what characters do you see on the next line?
    >
    >? ? ? ? ? ? ?


    Using Agent with Windows 7 64 bit I just see ? marks.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    If you give someone a program, you will frustrate them for a day; if you teach them how to program, you will frustrate them for a lifetime.
     
    Roedy Green, Dec 28, 2009
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Goche
    Replies:
    8
    Views:
    16,510
  2. Replies:
    3
    Views:
    1,802
    Timothy Bendfelt
    Jan 19, 2007
  3. Replies:
    9
    Views:
    1,011
    Juha Nieminen
    Aug 22, 2007
  4. Ioannis Vranos

    Non latin characters in string literals

    Ioannis Vranos, Jan 3, 2010, in forum: C Programming
    Replies:
    17
    Views:
    1,226
    Ben Bacarisse
    Jan 6, 2010
  5. Jeff.M
    Replies:
    6
    Views:
    186
    Lasse Reichstein Nielsen
    May 4, 2009
Loading...

Share This Page