Why No Supplemental Characters In Character Literals?

Lawrence D'Oliveiro · Feb 4, 2011

Why was it decreed in the language spec that characters beyond U+FFFF are
not allowed in character literals, when they are allowed everywhere else (in
string literals, in the program text, in character and string values etc)?

Lew · Feb 4, 2011

Why was it decreed in the language spec that characters beyond U+FFFF are
not allowed in character literals, when they are allowed everywhere else (in
string literals, in the program text, in character and string values etc)?

Because a 'char' type holds only 16 bits.

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Lawrence D'Oliveiro · Feb 4, 2011

Because a 'char' type holds only 16 bits.

No it doesnâ€™t. Otherwise you wouldnâ€™t be allowed supplementary characters in
character and string values. Which you are.

Mike Schilling · Feb 4, 2011

Lawrence D'Oliveiro said:
No it doesnâ€™t. Otherwise you wouldnâ€™t be allowed supplementary characters
in
character and string values. Which you are.

Yes, it does (contain 16 bits.) It was defined to do so before there were
supplemental characters, and there was no way to extend it without breaking
compatibility with some older programs.

You can't put a supplementary character in a char. You can put them in
strings, but only encoded as UTF-16, i.e. into two 16-bit chars.

Lew · Feb 4, 2011

It takes TWO 'char' values to represent a supplemental character. 'char' !=
"character".

READ the documentation.

I have an idea for you to try - check the documentation.
<http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.2.1>

and you see in Â§4.2: "... char, whose values are 16-bit unsigned integers ..."

Mike said:
Yes, it does (contain 16 bits.) It was defined to do so before there were
supplemental characters, and there was no way to extend it without breaking
compatibility with some older programs.

You can't put a supplementary character in a char. You can put them in
strings, but only encoded as UTF-16, i.e. into two 16-bit chars.

As the tutorials and JLS tell you, should you deign to read the documentation.
(It's not a bad idea to do so.)

--
Lew
Ceci n'est pas une fenÃªtre.
..___________.
|###] | [###|
|##/ | *\##|
|#/ * | \#|
|#----|----#|
|| | * ||
|o * | o|
|_____|_____|
|===========|

Joshua Cranmer · Feb 4, 2011

No it doesnâ€™t. Otherwise you wouldnâ€™t be allowed supplementary characters in
character and string values. Which you are.

The JLS clearly states that a char is an unsigned 16-bit value. Non-BMP
Unicode characters cannot fit in a single unsigned 16-bit value. Where
other literals compile down, you can use these non-BMP characters
because, e.g., Strings are not individual 16-bit values but an array of
them, and can thus safely hold a pair of them.

Arne VajhÃ¸j · Feb 4, 2011

No it doesnâ€™t. Otherwise you wouldnâ€™t be allowed supplementary characters in
character and string values. Which you are.

It is very clearly specified that a Java char is 16 bit.

You can't have the codepoints above U+FFFF in a char.

You can have them in a string but then they actually takes
two chars in that string.

It is rather messy.

If you look at the Java docs for String class you will see:

charAt & codePointAt
length & codePointCount

which is not a nice API.

But since codepoints above U+FFFF was added after the String
class was defined, then the options on how to handle it were
pretty limited.

Arne

Mike Schilling · Feb 4, 2011

Arne VajhÃ¸j said:
It is very clearly specified that a Java char is 16 bit.

You can't have the codepoints above U+FFFF in a char.

You can have them in a string but then they actually takes
two chars in that string.

It is rather messy.

If you look at the Java docs for String class you will see:

charAt & codePointAt
length & codePointCount

which is not a nice API.

But since codepoints above U+FFFF was added after the String
class was defined, then the options on how to handle it were
pretty limited.

The sticky issue is, I think, that chars were defined as 16-bit. If that
had been left undefined, they could have been extended to 24 bits, which
would make things nice and regular again.

Arne VajhÃ¸j · Feb 4, 2011

The sticky issue is, I think, that chars were defined as 16-bit. If that
had been left undefined, they could have been extended to 24 bits, which
would make things nice and regular again.

Yes.

But having specific bit lengths for all types was huge jump
forward compared to C89 regarding predictability of what code
would do.

Arne

Daniele Futtorovic · Feb 4, 2011

It is very clearly specified that a Java char is 16 bit.

You can't have the codepoints above U+FFFF in a char.

You can have them in a string but then they actually takes
two chars in that string.

It is rather messy.

If you look at the Java docs for String class you will see:

charAt & codePointAt
length & codePointCount

which is not a nice API.

But since codepoints above U+FFFF was added after the String
class was defined, then the options on how to handle it were
pretty limited.

They've added supplementary character support to String, StringBuilder,
StringBuffer.

Pity they haven't touched upon java.lang.CharSequence. Probably out of
concerns about compatibility.

Anyone got an idea how supplementary character support could be
integrated with CharSequence, or more generally, with an interface
describing a sequence of code points? Creating a sub-interface, e.g.
UnicodeSequence with int codePointAt(int), etc. doesn't seem like it'd
do the trick, since a UnicodeSequence /is-not/ a CharSequence (char
charAt(int) doesn't make sense for a UnicodeSequence). Adding a new
interface would mean you don't get the interoperability with all the
parts of the API that uses CharSequences... The only option would seem
to refactor CharSequence and all the classes that use or implement it.
Which means no backwards-compatibility.

Bloody mess this is.

Roedy Green · Feb 4, 2011

Why was it decreed in the language spec that characters beyond U+FFFF are
not allowed in character literals, when they are allowed everywhere else (in
string literals, in the program text, in character and string values etc)?

because they did not exist at the time Java was invented. extended
literals were tacked on to the 16-bit internal scheme in a somewhat
half-hearted way. to go to full 32-bit internally would gobble RAM
hugely.

Java does not have 32-bit String literals, like C style code points
e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
SurrogatePair applet (see
http://mindprod.com/applet/surrogatepair.html)
to convert C-style code points to a arcane surrogate pairs to let you
use 32-bit Unicode glyphs in your programs.

Personally, I don’t see the point of any great rush to support 32-bit
Unicode. The new symbols will be rarely used. Consider what’s there.
The only ones I would conceivably use are musical symbols and
Mathematical Alphanumeric symbols (especially the German black letters
so favoured in real analysis). The rest I can’t imagine ever using
unless I took up a career in anthropology, i.e. linear B syllabary (I
have not a clue what it is), linear B ideograms (Looks like symbols
for categorising cave petroglyphs), Aegean Numbers (counting with
stones and sticks), Old Italic (looks like Phoenecian), Gothic
(medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
(George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
syllabary, Byzantine music symbols (looks like Arabic), Musical
Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
extensions(Chinese Japanese Korean) and tags (letters with blank
“price tags”).

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Roedy Green · Feb 4, 2011

The JLS clearly states that a char is an unsigned 16-bit value.

Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
echar type will be invented.

It is an intractable problem. Consider the logic that uses indexOf
substring with character index arithmetic. Most if it would go insane
if you threw a few 32-bit chars in there. You need something that
simulates an array of 32-bit chars to the programmer.

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Joshua Cranmer · Feb 4, 2011

The sticky issue is, I think, that chars were defined as 16-bit. If that
had been left undefined, they could have been extended to 24 bits, which
would make things nice and regular again.

Well, the real problem is that Unicode swore that 16 bits were enough
for everybody, so people opted for the UTF-16 encoding in Unicode-aware
platforms (e.g., Windows uses 16-bit char values for wchar_t). When they
backtracked and increased the count to 20 bits, every system that did
UTF-16 was now screwed, because UTF-16 "kind of" becomes a
variable-width format like UTF-8... but not really. Instead you get a
mess with surrogate characters, this distinction between UTF-16 and
UCS-2, and, in short, anything not in the Basic Multilingual Plane is a
recipe for disaster.

Extending to 24 bits is problematic because 24 bits opens you up to
unaligned memory access on most, if not all, platforms, so you'd have to
go fully up to 32 bits (this is what the codePoint methods in String et
al. do). But considering the sheer amount of Strings in memory, going to
32-bit memory storage for Strings now doubles the size of that data...
and can increase memory consumption in some cases by 30-40%.

To make a long story short: Unicode made a very, very big mistake, and
everyone who designed their systems to be particularly i18n-aware before
that is now really smarting as a result.

It actually is possible to change the internal storage of String to a
UTF-8 representation (while keeping UTF-16/UTF-32 API access) and still
get good performance--people largely use direct indexes into strings in
largely consistent access patterns (e.g., str.substring(str.indexOf(":")
+ 1) ), so you can cache index lookup tables for a few values. It's ugly
as hell to code properly, taking into account proper multithreading,
etc., but it is not impossible.

markspace · Feb 4, 2011

Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
echar type will be invented.

An int is currently used for this purpose. For example,
Character.codePointAt(CharSequence,int) returns an int.

<http://download.oracle.com/javase/6/docs/api/java/lang/Character.html>

Also, from that same page, this explains the whole story in one go:

"Unicode Character Representations

"The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode standard
has since been changed to allow for characters whose representation
requires more than 16 bits. The range of legal code points is now U+0000
to U+10FFFF, known as Unicode scalar value. (Refer to the definition of
the U+n notation in the Unicode standard.)

"The set of characters from U+0000 to U+FFFF is sometimes referred to as
the Basic Multilingual Plane (BMP). Characters whose code points are
greater than U+FFFF are called supplementary characters. The Java 2
platform uses the UTF-16 representation in char arrays and in the String
and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from the
high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).

"A char value, therefore, represents Basic Multilingual Plane (BMP) code
points, including the surrogate code points, or code units of the UTF-16
encoding. An int value represents all Unicode code points, including
supplementary code points. The lower (least significant) 21 bits of int
are used to represent Unicode code points and the upper (most
significant) 11 bits must be zero.

....etc....

markspace · Feb 4, 2011

Pity they haven't touched upon java.lang.CharSequence. Probably out of
concerns about compatibility.

You know that Character has static methods for pulling code points out
of a CharSequence, right?

Lew · Feb 4, 2011

No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
character and string values. Which you are.

/* DemoChar */
package eg;
public class DemoChar
{
public static void main( String [] args )
{
System.out.println( "Character.MAX_VALUE + 1 = "
+ (Character.MAX_VALUE + 1) );

char foo1, foo2;
foo1 = (char) (Character.MAX_VALUE - 1);
foo2 = (char) (foo1 / 2);
System.out.println( "foo1 = "+ (int) foo1 +", foo2 = "+ (int)
foo2 );

foo1 = '§';
foo2 = '@';
char sum = (char) (foo1 + foo2);
System.out.println( "foo1 + foo2 = "+ sum );
}
}

Daniele Futtorovic · Feb 4, 2011

You know that Character has static methods for pulling code points out
of a CharSequence, right?

Yeah. But that's not quite the same thing, is it? What with OOP and all.

Daniele Futtorovic · Feb 4, 2011

because they did not exist at the time Java was invented. extended
literals were tacked on to the 16-bit internal scheme in a somewhat
half-hearted way. to go to full 32-bit internally would gobble RAM
hugely.

Java does not have 32-bit String literals, like C style code points
e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
SurrogatePair applet (see
http://mindprod.com/applet/surrogatepair.html)
to convert C-style code points to a arcane surrogate pairs to let you
use 32-bit Unicode glyphs in your programs.

Personally, I don’t see the point of any great rush to support 32-bit
Unicode. The new symbols will be rarely used. Consider what’s there.
The only ones I would conceivably use are musical symbols and
Mathematical Alphanumeric symbols (especially the German black letters
so favoured in real analysis). The rest I can’t imagine ever using
unless I took up a career in anthropology, i.e. linear B syllabary (I
have not a clue what it is), linear B ideograms (Looks like symbols
for categorising cave petroglyphs), Aegean Numbers (counting with
stones and sticks), Old Italic (looks like Phoenecian), Gothic
(medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
(George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
syllabary, Byzantine music symbols (looks like Arabic), Musical
Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
extensions(Chinese Japanese Korean) and tags (letters with blank
“price tags”).

And Klingon!

Tom Anderson · Feb 4, 2011

Extending to 24 bits is problematic because 24 bits opens you up to
unaligned memory access on most, if not all, platforms, so you'd have to
go fully up to 32 bits (this is what the codePoint methods in String et
al. do). But considering the sheer amount of Strings in memory, going to
32-bit memory storage for Strings now doubles the size of that data...
and can increase memory consumption in some cases by 30-40%.

This is something i ponder quite a lot.

It's essential that computers be able to represent characters from any
living human script. The astral planes include some such characters,
notably in the CJK extensions, without which it is impossible to write
some people's names correctly. The necessity of supporting more than 2**16
codepoints is simply beyond question.

The problem is how to do it efficiently.

Going to strings of 24- or 32-bit characters would indeed be prohibitive
in its effect in memory. But isn't 16-bit already an eye-watering waste?
Most characters currently sitting in RAM around the world are, i would
wager, in the ASCII range: the great majority of characters in almost any
text in a latin script will be ASCII, in that they won't have diacritics
[1] (and most text is still in latin script), and almost all characters in
non-natural-language text (HTML and XML markup, configuration files,
filesystem paths) will be ASCII. A sizeable fraction of non-latin text is
still encodable in one byte per character, using a national character set.
Forcing all users of programs written in Java (or any other platform which
uses UCS-2 encoding) to spend two bytes on each of those characters to
ease the lives of the minority of users who store a lot of CJK text seems
wildly regressive.

I am, however, at a loss to suggest a practical alternative!

A question to the house, then: has anyone ever invented a data structure
for strings which allows space-efficient storage for strings in different
scripts, but also allows time-efficient implementation of the common
string operations?

Upthread, Joshua mentions the idea of using UTF-8 strings, and cacheing
codepoint-to-bytepoint mappings. That's certainly an approach that would
work, although i worry about the performance effect of generating so many
writes, the difficulty of making it correct in multithreaded systems, and
the dependency on a good cache hit rate to make it pay off.

Anyone else?

For extra credit, give a representation which also makes it simple and
efficient to do normalisation, reversal, and "find the first occurrence of
this character, ignoring diacritics".

tom

[1] I would be interested to hear of a language (more properly, an
orthography) using latin script in which a majority of characters, or even
an unusually large fraction, do have diacritics. The pinyin romanisation
of Mandarin uses a lot of accents. Hawaiian uses quite a lot. Some ways of
writing ancient Greek use a lot of diacritics, for breathings and accents
and in verse, for long and short syllables.

Lawrence D'Oliveiro · Feb 4, 2011

Yes, it does (contain 16 bits.)

Yeah, I didnâ€™t realize it was spelled out that way in the original language
spec. What a short-sighted decision.

It was defined to do so before there were supplemental characters ...

Why was there a need to define the size of a character at all? Even in the
early days of the unification of Unicode and ISO-10646, there was already
provision for UCS-4. Did they really think that could safely be ignored?

Encoding of character literals	4	Nov 3, 2011
Multicharacter literals	4	Aug 22, 2012
32-bit characters in Java string literals	13	Dec 22, 2009
Non latin characters in string literals	17	Jan 3, 2010
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Regex: Any character in character class	17	Jan 30, 2013
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009

Why No Supplemental Characters In Character Literals?

Lawrence D'Oliveiro

Lew

Lawrence D'Oliveiro

Mike Schilling

Lew

Joshua Cranmer

Arne VajhÃ¸j

Mike Schilling

Arne VajhÃ¸j

Daniele Futtorovic

Roedy Green

Roedy Green

Joshua Cranmer

markspace

markspace

Lew

Daniele Futtorovic

Daniele Futtorovic

Tom Anderson

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads