Why No Supplemental Characters In Character Literals?

Joshua Cranmer · Feb 4, 2011

A question to the house, then: has anyone ever invented a data structure
for strings which allows space-efficient storage for strings in
different scripts, but also allows time-efficient implementation of the
common string operations?

I think the real answer is that maybe we need to rethink traditional
string APIs. Particularly, we have the issues of diacratics, since "A
[combining diacritic `]" is basically 1 character stored in 3,4, or 8
bytes, depending on storage format.

I would be surprised if there weren't already some studies on the impact
of using UTF-8 based strings in UTF-16/-32-ish contexts.

Lawrence D'Oliveiro · Feb 4, 2011

<http://download.oracle.com/javase/6/docs/api/java/lang/Character.html>

"The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities.

When did the unification with ISO-10646 happen? That was already talking
about 32-bit characters.

"A char value, therefore, represents Basic Multilingual Plane (BMP) code
points, including the surrogate code points, or code units of the UTF-16
encoding. An int value represents all Unicode code points, including
supplementary code points.

Why was there even a need to spell out the size of a char? If you wanted
types with explicit sizes, there was already byte, short, int and long.

Lawrence D'Oliveiro · Feb 4, 2011

Personally, I donÂ’t see the point of any great rush to support 32-bit
Unicode. ... The rest I canÂ’t imagine ever using unless I took up a career
in anthropology ...

But you, or another programmer, might work for an anthropologist. The
computer is a universal machine, after all. If a programming language canâ€™t
support that universality, what good is it?

Arne Vajhøj · Feb 4, 2011

When did the unification with ISO-10646 happen? That was already talking
about 32-bit characters.

Why was there even a need to spell out the size of a char? If you wanted
types with explicit sizes, there was already byte, short, int and long.

It provides well defined semantics.

Nobody wanted to repeat C89 undefined/implementation specific
behavior.

Arne

Roedy Green · Feb 4, 2011

Why was there even a need to spell out the size of a char? If you wanted
types with explicit sizes, there was already byte, short, int and long.

I think because Java's designers thought on the byte code level.
There, chars are unsigned 16-bit. That they are used for chars was not
really of interest to them. Much of Java is just a thin wrapper
around byte code. It has no high level features of its own.

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Roedy Green · Feb 4, 2011

Why was there a need to define the size of a character at all?

Because C did worked that way and lead to non-wora code.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Arne VajhÃ¸j · Feb 4, 2011

Yeah, I didnâ€™t realize it was spelled out that way in the original language
spec.

It is. And give that you in another thread talk about problems in JLS,
then I think you should have read it.

It should also be in most Java beginners books.

It is also in the Java tutorial:

http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

> What a short-sighted decision.

Back then Unicode was 16 bit.

The increase in bit was done in 1996 after the release of Java 1.0.

Why was there a need to define the size of a character at all?

Well defines data types is a very good thing.

Even in the
early days of the unification of Unicode and ISO-10646, there was already
provision for UCS-4.

Java decided to do Unicode. And at that time 16 bit was sufficient
for that.

Did they really think that could safely be ignored?

Apparently yes.

Given that the 16 bit had just replaced 8 bit, then I think it
is understandable.

Arne

Roedy Green · Feb 4, 2011

Well, the real problem is that Unicode swore that 16 bits were enough
for everybody,

be fair. I bought a book showing thousands of Unicode glyphs including
pages and pages of HAN ideographs. There were plenty of holes for
future growth. At the time I thought it was overkill. When I started
my career, character sets had 64 glyphs, including control chars.
Later it was considered "extravagant" to use lower case since it took
so much longer to print. In the very early days, each installation
designed its own local character set. I recall sitting in one such
meeting, and Vern Detwiler (later of MacDonald Detwiler) explaining
the virtues of new code called ASCII.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Arne Vajhøj · Feb 4, 2011

because they did not exist at the time Java was invented. extended
literals were tacked on to the 16-bit internal scheme in a somewhat
half-hearted way. to go to full 32-bit internally would gobble RAM
hugely.

Java does not have 32-bit String literals, like C style code points
e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
SurrogatePair applet (see
http://mindprod.com/applet/surrogatepair.html)
to convert C-style code points to a arcane surrogate pairs to let you
use 32-bit Unicode glyphs in your programs.

Personally, I don’t see the point of any great rush to support 32-bit
Unicode. The new symbols will be rarely used. Consider what’s there.
The only ones I would conceivably use are musical symbols and
Mathematical Alphanumeric symbols (especially the German black letters
so favoured in real analysis). The rest I can’t imagine ever using
unless I took up a career in anthropology, i.e. linear B syllabary (I
have not a clue what it is), linear B ideograms (Looks like symbols
for categorising cave petroglyphs), Aegean Numbers (counting with
stones and sticks), Old Italic (looks like Phoenecian), Gothic
(medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
(George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
syllabary, Byzantine music symbols (looks like Arabic), Musical
Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
extensions(Chinese Japanese Korean) and tags (letters with blank
“price tags”).

Most western people does never use them.

But that does not mean much as we got our stuff in the low codepoints.

The relevant question is whether Chinese/Japanese/Korean use the

>=64K code points.

Arne

Arne Vajhøj · Feb 4, 2011

be fair. I bought a book showing thousands of Unicode glyphs including
pages and pages of HAN ideographs. There were plenty of holes for
future growth. At the time I thought it was overkill. When I started
my career, character sets had 64 glyphs, including control chars.
Later it was considered "extravagant" to use lower case since it took
so much longer to print. In the very early days, each installation
designed its own local character set. I recall sitting in one such
meeting, and Vern Detwiler (later of MacDonald Detwiler) explaining
the virtues of new code called ASCII.

Impressive that he wanted to discuss that with a 12 year old.

Arne

Arne VajhÃ¸j · Feb 4, 2011

But you, or another programmer, might work for an anthropologist. The
computer is a universal machine, after all. If a programming language canâ€™t
support that universality, what good is it?

The idea that a single programming language needs to support
everything is not a good one.

Maybe Java is just not the right language for anthropology.

If they had new that Unicode would go beyond 64K, then they
probably would have come up with a different solution.

But they did not.

And we live it it.

Arne

Arne Vajhøj · Feb 4, 2011

Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
echar type will be invented.

It is an intractable problem. Consider the logic that uses indexOf
substring with character index arithmetic. Most if it would go insane
if you threw a few 32-bit chars in there. You need something that
simulates an array of 32-bit chars to the programmer.

I don't think they can come up with a solution that both provides
good support for the high code points and will keep old code
running unchanges.

echar and EString would keep old stuff running, but would
completely blow up the entire API.

Arne

Roedy Green · Feb 4, 2011

I am, however, at a loss to suggest a practical alternative!

What might happen is strings are nominally 32-bit.

You could probably come up with a very rapid compression scheme,
similar to UTF-8 but with a bit more compression, that could be
applied to strings at garbage collection time if they have not been
referenced since the last GC sweep.

String are immutable. This admits some other flavours of
"compression".

If the high three bytes of the string are 0, store the string
UNCOMPRESSED, as a string of bytes. All the indexOf indexing
arithmetic works identically. This behaviour is hidden inside the
JVM. The String class knows nothing about it. It is an implementation
detail of 32-bit strings.

If the high two bytes of the string are 0, store the string
uncompressed as a string of unsigned shorts.

if there are any one bits in the high 2 byte, store as a string of
unsigned ints.

Strings are what you gobble up your RAM with. If we start supporting
32 bit chars, we have to do something to compensate for the doubling
of RAM use.

Short lived strings would still be 32-bit. They would only be
converted to the other forms if they have been sitting around for a
while. Interned strings would be immediately converted to canonical
form.

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Roedy Green · Feb 4, 2011

But you, or another programmer, might work for an anthropologist. The
computer is a universal machine, after all. If a programming language canâ€™t
support that universality, what good is it?

Of course, It is just PERSONALLY I am not likely to use many of these
character sets. They are not important for business, so I doubt Oracle
puts support for them a high priority.

I get a strange pleasure out of poking around the odd corners of the
Unicode glyphs, just admiring the art of various cultures in designing
their alphabets, often baffled which they would design so many letters
almost identical. I'd love to have an excuse to paint with these
glyphs. The glyphs that fascinate me most are Arabic which look to
have rules of typography that boggle the western mind.

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.

Arne Vajhøj · Feb 4, 2011

Of course, It is just PERSONALLY I am not likely to use many of these
character sets. They are not important for business, so I doubt Oracle
puts support for them a high priority.

Sure about that?

Some claim that the economies of China, Japan and South Korea
are pretty important for business.

The question is whether the high code points are important
for those. I don't know.

Arne

Arne Vajhøj · Feb 4, 2011

What might happen is strings are nominally 32-bit.

You could probably come up with a very rapid compression scheme,
similar to UTF-8 but with a bit more compression, that could be
applied to strings at garbage collection time if they have not been
referenced since the last GC sweep.

String are immutable. This admits some other flavours of
"compression".

If the high three bytes of the string are 0, store the string
UNCOMPRESSED, as a string of bytes. All the indexOf indexing
arithmetic works identically. This behaviour is hidden inside the
JVM. The String class knows nothing about it. It is an implementation
detail of 32-bit strings.

If the high two bytes of the string are 0, store the string
uncompressed as a string of unsigned shorts.

if there are any one bits in the high 2 byte, store as a string of
unsigned ints.

Strings are what you gobble up your RAM with. If we start supporting
32 bit chars, we have to do something to compensate for the doubling
of RAM use.

Short lived strings would still be 32-bit. They would only be
converted to the other forms if they have been sitting around for a
while. Interned strings would be immediately converted to canonical
form.

indexOf works fine with compression, but substring and charAt becomes
rather expensive.

Arne

Lawrence D'Oliveiro · Feb 4, 2011

I get a strange pleasure out of poking around the odd corners of the
Unicode glyphs ...

Youâ€™re not the only one.

... just admiring the art of various cultures in designing their
alphabets, often baffled which they would design so many letters
almost identical.

There seem to be an awful lot of cases of adapting a letter from one
alphabet for a completely different purpose in another. Look at the
correspondences between Cyrillic and Roman, just for example: V â†’ B, S â†’ C,
that kind of thing.

The glyphs that fascinate me most are Arabic which look to have rules of
typography that boggle the western mind.

And Arabic script was adopted by a whole lot of different languages which
had sounds that Arabic did not. So they had to make up their own letters,
most commonly by adding different numbers of dots to the existing shapes.

Joshua Cranmer · Feb 5, 2011

Yeah, I didnâ€™t realize it was spelled out that way in the original language
spec. What a short-sighted decision.

It would have been stupider to have not specified a guaranteed size for
char. Take C (+ POSIX), where the definitions of sizes are very loosely
defined, and you very quickly get non-portable code. Yes, you can in
theory change the size of, say, time_t independently of other types, but
it doesn't do you much good if half the C code assumes sizeof(time_t) ==
sizeof(int). Pinning down the sizes of the types was a _very good_ move
on Java's part.

Why was there a need to define the size of a character at all? Even in the
early days of the unification of Unicode and ISO-10646, there was already
provision for UCS-4. Did they really think that could safely be ignored?

Knowing the results of other properly Unicode-aware code in the first
days of Unicode, I believe that Unicode quite heavily gave an impression
of "Unicode == 16 bit". Java is not the only major platform to be bitten
by now-Unicode-is-32-bits... the Windows platform has 16-bit characters
embedded into it.

Joshua Cranmer · Feb 5, 2011

indexOf works fine with compression, but substring and charAt becomes
rather expensive.

I have seen it argued that random-access-ish stuff like substring and
charAt aren't really all that random access, in that they tend to be
"small" constants away from the beginning, end, or last indexOf computation.

See
<http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html>.

markspace · Feb 5, 2011

Yeah. But that's not quite the same thing, is it? What with OOP and all.

Fair enough.

Since it's not possible to add new methods to an interface without
breaking all existing subclasses, I have to assume that is why
CharSequence was never modified.

The Lambda project for Java has been working on closures. They've also
proposed extension methods/defender methods to allow Java interfaces to
be modified. I think the best chance of getting CharSequence modified
would be through that mechanism when it becomes available.

I'm not sure off hand who is working on the extension methods. It might
be a good idea to contact them about getting CharSequence modified along
with whatever else they'll be doing.

Encoding of character literals	4	Nov 3, 2011
Multicharacter literals	4	Aug 22, 2012
32-bit characters in Java string literals	13	Dec 22, 2009
Non latin characters in string literals	17	Jan 3, 2010
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Regex: Any character in character class	17	Jan 30, 2013
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009

Why No Supplemental Characters In Character Literals?

Joshua Cranmer

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Arne Vajhøj

Roedy Green

Roedy Green

Arne VajhÃ¸j

Roedy Green

Arne Vajhøj

Arne Vajhøj

Arne VajhÃ¸j

Arne Vajhøj

Roedy Green

Roedy Green

Arne Vajhøj

Arne Vajhøj

Lawrence D'Oliveiro

Joshua Cranmer

Joshua Cranmer

markspace

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads