A reply for rusi (FSR)

Thomas 'PointedEars' Lahn · Mar 16, 2013

Chris said:
The ECMAScript spec says that strings are stored and represented in
UTF-16.

No, it does not (which Edition?). It says in Edition 5.1:

| 8.4 The String Type
|
| The String type is the set of all finite ordered sequences of zero or more
| 16-bit unsigned integer values (â€œelementsâ€). [â€¦] Each element is regarded
| as occupying a position within the sequence. These positions are indexed
| with nonnegative integers. The first element (if any) is at position 0,
| the next element (if any) at position 1, and so on. The length of a
| String is the number of elements (i.e., 16-bit values) within it.
|
| [â€¦]
| When a String contains actual textual data, each element is considered to
| be a single UTF-16 code unit. Whether or not this is the actual storage
| format of a String, the characters within a String are numbered by
| their initial code unit element position as though they were represented
| using UTF-16. All operations on Strings (except as otherwise stated) treat
| them as sequences of undifferentiated 16-bit unsigned integers; they do
| not ensure the resulting String is in normalised form, nor do they ensure
| language-sensitive results.
|
| NOTE
| The rationale behind this design was to keep the implementation of Strings
| as simple and high-performing as possible. The intent is that textual data
| coming into the execution environment from outside (e.g., user input, text
| read from a file or received over the network, etc.) be converted to
| Unicode Normalised Form C before the running program sees it. Usually this
| would occur at the same time incoming text is converted from its original
| character encoding to Unicode (and would impose no additional overhead).
| Since it is recommended that ECMAScript source code be in Normalised Form
| C, string literals are guaranteed to be normalised (if source text is
| guaranteed to be normalised), as long as they do not contain any Unicode
| escape sequences.

You can see the same thing in Javascript too. Here's a little demo I
just knocked together:

<script>
function foo()
{
var txt=document.getElementById("in").value;
var msg="";
for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
"+txt.charCodeAt(i).toString(16)+"\n";
document.getElementById("out").value=msg;
}
</script>
<input id=in><input type=button onclick="foo()"
value="Show"><br><textarea id=out rows=25 cols=80></textarea>

What an awful piece of code.

Give it an ASCII string

You mean a string of Unicode characters that can also be represented with
the US-ASCII encoding. There are no "ASCII strings" in conforming
ECMAScript implementations. And a string of Unicode characters with code
points within the BMP will suffice already.

and you'll see, as expected, one index (based on string indexing or
charCodeAt, same thing) for each character. Same if it's all BMP. But put
an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR
notation doesn't work in Unicode) come up. I raised this issue on the
Google V8 list and on the ECMAScript list (e-mail address removed), and was
basically told that since JavaScript has been buggy for so long, there's
no chance of ever making it bug-free:

https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html

You misunderstand, and I am not buying Rick's answer. The problem is not
that String values are defined as units of 16 bits. The problem is that the
length of a primitive String value in ECMAScript, and the position of a
character, is defined in terms of 16-bit units instead of characters. There
is no bug, because ECMAScript specifies that Unicode characters beyond the
Basic Multilingual Plane (BMP) need not be supported:

| 2 Conformance
|
| A conforming implementation of this Standard shall interpret characters in
| conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
| 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
| implementation level 3. If the adopted ISO/IEC 10646-1 subset is not
| otherwise specified, it is presumed to be the BMP subset, collection 300.
| If the adopted encoding form is not otherwise specified, it presumed to
| be the UTF-16 encoding form.

But they can:

| A conforming implementation of ECMAScript is permitted to provide
| additional types, values, objects, properties, and functions beyond those
| described in this specification. In particular, a conforming
| implementation of ECMAScript is permitted to provide properties not
| described in this specification, and values for those properties, for
| objects that are described in this specification.
|
| A conforming implementation of ECMAScript is permitted to support program
| and regular expression syntax not described in this specification. In
| particular, a conforming implementation of ECMAScript is permitted to
| support program syntax that makes use of the â€œfuture reserved wordsâ€
| listed in 7.6.1.2 of this specification.

People have found ways to make this work in ECMAScript implementations. For
example, it is possible to scan a normalized string for lead surrogates:

String.fromCharCode = (function () {
var _fromCharCode = String.fromCharCode;
var span;

return function () {
var a = [];

for (var i = 0, len = arguments.length; i < len; ++i)
{
var arg = arguments;
var ch;

if (arg > 0xFFFF)
{
if (typeof span == "undefined")
{
span = document.createElement("span");
}

span.innerHTML = "&#" + arg + ";";
ch = span.firstChild.nodeValue;
}
else
{
ch = _fromCharCode(arg);
}

a.push(ch);
}

return a.join("");
};
}());

/* "ð„¢" (U+1D122 MUSICAL SYMBOL F CLEF) */
var sFClef = String.fromCharCode(0x1D122);

String.prototype.getLength = function () {
return (this.match(/[\uD800-\uDBFF][^\uD800-\uDBFF]|[\S\s]/g)
|| []).length;
};

/* 1 */
sFClef.getLength()

(String.prototype.charAt() etc. are left as an exercise to the reader.)

Tested in Chromium 25.0.1364.160 Debian 7.0 (186726), which according to
Wikipedia should feature V8 3.15.11.5.

But yes, there should be native support for Unicode characters with code
points beyond the BMP, and evidently that does _not_ require a second
language; just a few tweaks to the algorithms.

Fortunately for Python, there are version numbers, and policies that
permit bugs to actually get fixed. (Which is why, for instance, Debian
Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case
some script is broken by that change.

Click to expand...

Debian already ships Python 3.1 in Stable, disproving your argument.

Can't do that with web browsers.)

Click to expand...

Yes, you could. It has been done before.

As of Python 3.3, all Pythons function the same way: it's
semantically a "wide build" (UTF-32), but with a memory usage
optimization. That's how it needs to be.

Click to expand...

It is _not_ necessary to use the memory-expensive UTF-32 or a memory-cheaper
mixed encoding to represent characters beyond the BMP. UTF-32 would be more
runtime-efficient than any other encoding for such strings, though, because
you could divide by 32 for the length and would not have to find lead
surrogates to determine a character's position.

Mark Lawrence · Mar 16, 2013

Chris Angelico wrote:

Thomas and Chris, would the two of you be kind enough to explain to
morons such as myself how all the ECMAScript stuff relates to Python's
unicode as implemented via PEP 393 as you've lost me, easily done I know.

Chris Angelico · Mar 16, 2013

No, it does not (which Edition?). It says in Edition 5.1:

Okay, I was sloppy in my terminology. A language will seldom, if ever,
specify the actual storage. But it does specify a representation (to
the script) of UTF-16, and I seriously cannot imagine any reason for
an implementation to store a string in any other way, given that
string indexing is specifically based on UTF-16:

| The length of a
| String is the number of elements (i.e., 16-bit values) within it.
|
| […]
| When a String contains actual textual data, each element is considered to
| be a single UTF-16 code unit. Whether or not this is the actual storage
| format of a String, the characters within a String are numbered by
| their initial code unit element position as though they were represented
| using UTF-16.

So, yes, it could be stored in some other way, but in terms of what I
was saying (comparing against Python 3.2 and 3.3), it's still a
specification that doesn't allow for the change that Python did. If
narrow builds are all you compare against (as jmf does), then Python
3.2 is exactly like ECMAScript, and Python 3.3 isn't.

You can see the same thing in Javascript too. Here's a little demo I
just knocked together:

<script>
function foo()
{
var txt=document.getElementById("in").value;
var msg="";
for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
"+txt.charCodeAt(i).toString(16)+"\n";
document.getElementById("out").value=msg;
}
</script>
<input id=in><input type=button onclick="foo()"
value="Show"><br><textarea id=out rows=25 cols=80></textarea>

Click to expand...

What an awful piece of code.

Ehh, it's designed to be short, not beautiful. Got any serious
criticisms of it? It demonstrates what I'm talking about without being
a page of code.

You mean a string of Unicode characters that can also be represented with
the US-ASCII encoding. There are no "ASCII strings" in conforming
ECMAScript implementations. And a string of Unicode characters with code
points within the BMP will suffice already.

You can get a string of ASCII characters and paste them into the entry
field. They'll be turned into Unicode characters before the script
sees them. But yes, okay, my terminology was a bit sloppy.

You misunderstand, and I am not buying Rick's answer. The problem is not
that String values are defined as units of 16 bits. The problem is that the
length of a primitive String value in ECMAScript, and the position of a
character, is defined in terms of 16-bit units instead of characters. There
is no bug, because ECMAScript specifies that Unicode characters beyond the
Basic Multilingual Plane (BMP) need not be supported:

So what you're saying is that an ES implementation is allowed to be
even buggier than I described, and that's somehow a justification?

People have found ways to make this work in ECMAScript implementations. For
example, it is possible to scan a normalized string for lead surrogates:

And it's possible to write a fully conforming Unicode handler in C,
using char[] and relying on (say) UTF-8 encoding. That has nothing to
do with the language actually providing facilities.

But yes, there should be native support for Unicode characters with code
points beyond the BMP, and evidently that does _not_ require a second
language; just a few tweaks to the algorithms.

No, it requires either a complete change of the language, or the
acceptance that O(1) operations can now become O(n) on the length of
the string (if the string is left in UTF-16 but indexed in Unicode),
or the creation of a new user-space data type (which then has to be
converted any time it's given to any standard library function).

Debian already ships Python 3.1 in Stable, disproving your argument.

Separate branch. Debian stable ships one from each branch; Debian
unstable does, too (2.7.3 and 3.2.3). Same argument applies to each,
though - even Debian unstable hasn't yet introduced Python 3.3, in
case it breaks stuff. Argument not disproved.

Yes, you could. It has been done before.

Not easily. Assuming you can't make one a perfect super/subset of the
other (as with "use strict"), it needs to be done as a completely
separate language. Now, maybe it's time the <script> tag got versioned
(again? what happened to language="javascript1.2"?), but the normal
way for scripts to be put onto a page doesn't allow version tagging,
and especially, embedding code into other attributes doesn't make that
easy:

It is _not_ necessary to use the memory-expensive UTF-32 or a memory-cheaper
mixed encoding to represent characters beyond the BMP. UTF-32 would be more
runtime-efficient than any other encoding for such strings, though, because
you could divide by 32 for the length and would not have to find lead
surrogates to determine a character's position.

Of course it's not the only way to represent all of Unicode. But when
you provide string indexing (charCodeAt), programmers will assume it
is cheap, and casually index strings (from both ends and maybe the
middle too). A system in which string indexing isn't O(1) is going to
perform highly suboptimally with many common string operations, so its
programmers would be forced to learn its foibles or pay the cost.

ChrisA

Chris Angelico · Mar 16, 2013

Thomas and Chris, would the two of you be kind enough to explain to morons
such as myself how all the ECMAScript stuff relates to Python's unicode as
implemented via PEP 393 as you've lost me, easily done I know.

Sure. Here's the brief version: It's all about how a string is exposed
to a script.

* Python 3.2 Narrow gives you UTF-16. Non-BMP characters count twice.
* Python 3.2 Wide gives you UTF-32. Each character counts once.
* Python 3.3 gives you UTF-32, but will store it as compactly as possible.
* ECMAScript specifies the semantics of Python 3.2 Narrow.

Python 3.2 was either buggy or inefficient. (Generally, Windows builds
were buggy and Linux builds were inefficient, but you could pick at
compilation time.) String indexing followed obvious rules, as long as
everything fitted inside UCS-2, or you paid the
four-bytes-per-character price of a wide build. Otherwise, stuff went
off-kilter. PEP 393 fixed the matter, and the arguments were about
implementation, efficiency, and so on - but (far as I know) nobody
ever argued that the semantics of UTF-16 strings should be kept.
That's the difference with ES - that behaviour, peculiar though it be,
is actually mandated by the spec. I have banged my head against it at
work (amazingly, PHP's complete lack of native Unicode support is
actually easier to work with there - though mainly I just throw the
stuff at PostgreSQL, which will throw an error back if anything's
wrong); it's an insane mandate. But it's part of the spec, and it
can't be changed now.

ChrisA

Thomas 'PointedEars' Lahn · Mar 16, 2013

Chris said:
Okay, I was sloppy in my terminology. A language will seldom, if ever,
specify the actual storage. But it does specify a representation (to
the script) of UTF-16,

No, it does not.

and I seriously cannot imagine any reason for an implementation to store a
string in any other way, given that string indexing is specifically based
on UTF-16:

Non sequitur.

| The length of a String is the number of elements (i.e., 16-bit values)
| within it.
|
| [â€¦]
| When a String contains actual textual data, each element is considered
| to
| be a single UTF-16 code unit. Whether or not this is the actual
| storage format of a String, the characters within a String are numbered
| by their initial code unit element position as though they were
| represented using UTF-16.

Click to expand...

So, yes, it could be stored in some other way, but in terms of what I
was saying (comparing against Python 3.2 and 3.3), it's still a
specification that doesn't allow for the change that Python did.

Yes, it does. You must have not been reading or understanding what I
quoted.

You can see the same thing in Javascript too. Here's a little demo I
just knocked together:

<script>
function foo()
{
var txt=document.getElementById("in").value;
var msg="";
for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
"+txt.charCodeAt(i).toString(16)+"\n";
document.getElementById("out").value=msg;
}
</script>
<input id=in><input type=button onclick="foo()"
value="Show"><br><textarea id=out rows=25 cols=80></textarea>

Click to expand...

What an awful piece of code.

Click to expand...

Ehh, it's designed to be short, not beautiful. Got any serious
criticisms of it?

Better not here, lest another â€œmoronâ€ would complain.

It demonstrates what I'm talking about without being a page of code.

It could have been written readable and efficient without that.

You can get a string of ASCII characters and paste them into the entry
field.

Not likely these days, no.

They'll be turned into Unicode characters before the script
sees them.

They will have become Windows-1252 or even Unicode characters long before.

But yes, okay, my terminology was a bit sloppy.

It still is.

So what you're saying is that an ES implementation is allowed to be
even buggier than I described, and that's somehow a justification?

No, I am saying that you have no clue what you are talking about.

But yes, there should be native support for Unicode characters with code
points beyond the BMP, and evidently that does _not_ require a second
language; just a few tweaks to the algorithms.

Click to expand...

No, it requires either a complete change of the language, [â€¦]

No, it does not. Get yourself informed.

Not easily.

You have still no clue what you are talking about. Get yourself informed at
least about the (deprecated/obsolete) â€œlanguageâ€ and the (standards-
compliant) â€œtypeâ€ attribute of SCRIPT/â€œscriptâ€ elements before you post on
this again.

rusi · Mar 16, 2013

Thomas and Chris, would the two of you be kind enough to explain to
morons such as myself how all the ECMAScript stuff relates to Python's
unicode as implemented via PEP 393 as you've lost me, easily done I know.

The unicode standard is language-agnostic.
Unicode implementations exist withing a language x implementation x C-
compiler implementation x … -- Notice the gccs in Andriy's
comparison. Do they signify?

$ python3.2
Python 3.2.3 (default, Jun 25 2012, 22:55:05)
[GCC 4.6.3] on linux2

$ python3.3
Python 3.3.0 (default, Sep 29 2012, 15:35:49)
[GCC 4.7.1] on linux

The number of actual python implementations is small -- 2.7, 3.1, 3.2,
3.3 -- at most enlarged with wides and narrows; The number of possible
implementations is large (in principle infinite) -- a small example of
a point in design-space that is not explored: eg

There are 17 planes x 2^16 chars in a plane
< 32 x 2^16
= 2^5 x 2^16
= 2^21

ie wide unicode (including the astral planes) can fit into 21 bits
ie 3 wide-chars can fit into 64 bit slot rather than 2.
Is this option worth considering? Ive no idea and I would wager that
no one does until some trials are done

So… Coming back to your question… Checking what other languages are
doing speeds up the dream->design->implement->performance-check cycle

rusi · Mar 16, 2013

Sure. Here's the brief version: It's all about how a string is exposed
to a script.

* Python 3.2 Narrow gives you UTF-16. Non-BMP characters count twice.
* Python 3.2 Wide gives you UTF-32. Each character counts once.
* Python 3.3 gives you UTF-32, but will store it as compactly as possible..

Framing issue here (made famous by en.wikipedia.org/wiki/
George_Lakoff)

When one uses words like 'compact' 'flexible' etc it loads the dice in
favour of 3.3 choices.
And ignores that 3.3 trades time for space.

Mark Lawrence · Mar 16, 2013

Framing issue here (made famous by en.wikipedia.org/wiki/
George_Lakoff)

When one uses words like 'compact' 'flexible' etc it loads the dice in
favour of 3.3 choices.
And ignores that 3.3 trades time for space.

As stated in PEP 393 so what's all the fuss about?

Terry Reedy · Mar 16, 2013

And ignores that 3.3 trades time for space.

This is at least a partial falsehood.
It is really sad to see you parroting this.

rusi · Mar 16, 2013

You have still no clue what you are talking about. Get yourself informed at
least about the (deprecated/obsolete) “language” and the (standards-
compliant) “type” attribute of SCRIPT/“script” elements before you post on
this again.

An emotional 'PointedEars?'
Now have I dropped into an alternate universe?

Steven D'Aprano · Mar 16, 2013

Sure. Here's the brief version: It's all about how a string is exposed
to a script.

* Python 3.2 Narrow gives you UTF-16. Non-BMP characters count twice.
* Python 3.2 Wide gives you UTF-32. Each character counts once.
* Python 3.3 gives you UTF-32, but will store it as compactly as
possible.
* ECMAScript specifies the semantics of Python 3.2 Narrow.

And just for the record:

Unicode actually doesn't define what a "character" is. Instead, it talks
about "code points", but for our purposes we can gloss over the
differences and pretend that they are almost the same thing, except where
noted. Some code points represent characters, some represent non-
characters, and some are currently unused.

UTF-16 is a *variable length* storage mechanism that says each code point
takes either two or four bytes. Since Unicode includes more code points
than will fit in two bytes, UTF-16 includes a mechanism for dealing with
the additional code points:

* The first 65,536 code points are defined as the "Basic Multilingual
Plane", or BMP. Each code point in the BMP is represented in UTF-16 by a
16-bit value.

* The remaining 16 sets of 65,536 code points are defined as
"Supplementary Multilingual Planes", or SMPs. Each code point in a SMP is
represented by two 16 bit values, called a "surrogate pair". The "lead
surrogate" will be in the range 0xD800...0xDBFF and the "trail surrogate"
will be in the range 0xDC00...0xDFFF.

The disadvantage here is that you can't tell how far into a string you
need to go to get to (say) the 1000th character (code point). If all of
the first 1000 code points are in the BMP, then you can jump directly to
byte offset 2000. If all of them are in a SMP, then you can jump directly
to byte offset 4000. But since you don't usually know how many SMP code
points are in the string, you have to walk through the string:

# Pseudo-code to return the code-point in string at index.
offset = 0 # Offset in pairs of bytes.
counter = 0
while offset < length of string counted in pairs of bytes:
if string[offset] in 0xD800...0xDBFF:
# Lead surrogate of a surrogate pair.
if counter == index:
return string[offset

ffset+1]
else:
counter += 1
index += 2 # Skip the next pair.
elif string[offset] in 0xDC00...0xDFFF:
# Trail surrogate found outside of a surrogate pair.
raise Error
else:
# BMP code point.
if counter == index:
return string[offset]
else:
counter += 1
index += 1

What a mess! Slow and annoying to get right. Not surprisingly, most
implementations of UTF-16 don't do this, and Python is one of them.
Instead, they assume that all code points take up the same space, and
consequently they let you create *invalid Unicode strings* by splitting a
surrogate pair:

This is in Python 3.2 narrow build:

py> s = chr(70000)
py> len(s)
2
py> a = s[0]
py> a == s
False
py> print(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud804' in
position 0: surrogates not allowed

Oops! We have created an invalid Unicode string. A wide build will fix
this, because it uses UTF-32 instead.

UTF-32 is a *fixed width* storage mechanism where every code point takes
exactly four bytes. Since the entire Unicode range will fit in four
bytes, that ensures that every code point is covered, and there is no
need to walk the string every time you perform an indexing operation. But
it means that if you're one of the 99.9% of users who mostly use
characters in the BMP, your strings take twice as much space as
necessary. If you only use Latin1 or ASCII, your strings take four times
as much space as necessary.

So a Python wide build uses more memory, but gets string processing
right. Python 3.3 has the best of both worlds, fixing the invalid string
processing and avoiding memory waste.

One of the complications here is that often people conflate UTF-16 and
UCS-2. It isn't entirely clear to me, but it seems that ECMAScript may be
doing that. UCS-2 specifies that *all characters* are represented by
*exactly* 16 bits (two bytes), and hence it is completely unable to deal
with the Supplementary Multilingual Planes at all.

If I have read the ECMAScript spec correctly, it *completely ignores* the
issue of surrogate pairs, and so is underspecified. Instead it talks
about normalised form, which is entirely unrelated. Normalisation relates
to the idea that many characters can be represented in two or more ways.
For example, the character "Ã¤" can be represented in at least two forms:

Normalised form: U+00E4 LATIN SMALL LETTER A WITH DIAERESIS

Canonical decomposition: U+0061 LATIN SMALL LETTER A + U+0308 COMBINING
DIAERESIS

So both of these two strings represent the same letter, even though the
second uses two code points and the first only one:

py> a = "\N{LATIN SMALL LETTER A WITH DIAERESIS}"
py> b = "a\N{COMBINING DIAERESIS}"

Arguably, they should print the same way too, although I think that will
depend on how smart your terminal is, and/or on the font you use.

But I digress. The ECMAScript spec carefully specifies that it makes no
guarantees about normalisation, which is right and proper for a language,
but it says nothing about surrogates, and that is very poor.

So as I said, I suspect that ECMAScript is actually referring to UCS-2
when it mentions UTF-16. If I'm right, that's pretty lousy.

Steven D'Aprano · Mar 16, 2013

And ignores that 3.3 trades time for space.

So what? Lists, dicts and sets trade time for space: they are generally
over-allocated to ensure a certainly level of performance. The language
designers are perfectly permitted to make that choice. If somebody wants
to make another choice they can design their own language, or write their
own data structures, or put in a bug report and hope to influence the
language designers to change their minds. Why should strings be treated
any differently?

Steven D'Aprano · Mar 16, 2013

The unicode standard is language-agnostic. Unicode implementations exist
withing a language x implementation x C- compiler implementation x â€¦ --
Notice the gccs in Andriy's comparison. Do they signify?

They should not. Ideally, the behaviour of Python should be identical
regardless of the compiler used to build the Python interpreter.

In practice, this is not necessarily the case. One compiler might
generate more efficient code than another. But aside from *performance*,
the semantics of what Python does should be identical, except where noted
as "implementation dependent".

The number of actual python implementations is small -- 2.7, 3.1, 3.2,
3.3 -- at most enlarged with wides and narrows; The number of possible
implementations is large (in principle infinite)

IronPython and Jython will, if I understand correctly, inherit their
string implementations from .Net and Java.

-- a small example of a point in design-space that is not explored: eg

There are 17 planes x 2^16 chars in a plane < 32 x 2^16
= 2^5 x 2^16
= 2^21

ie wide unicode (including the astral planes) can fit into 21 bits ie 3
wide-chars can fit into 64 bit slot rather than 2. Is this option worth
considering? Ive no idea and I would wager that no one does until some
trials are done

As I understand it, modern CPUs and memory chips are optimized for
dealing with either two things:

- single bytes;

- even numbers of bytes, e.g. 16 bits, 32 bits, 64 bits, ...

but not odd numbers of bytes, e.g. 24 bits, 40 bits, 72 bits, ...

So while you might save memory by using "UTF-24" instead of UTF-32, it
would probably be slower because you would have to grab three bytes at a
time instead of four, and the hardware probably does not directly support
that.

Roy Smith · Mar 16, 2013

Steven D'Aprano said:
UTF-32 is a *fixed width* storage mechanism where every code point takes
exactly four bytes. Since the entire Unicode range will fit in four
bytes, that ensures that every code point is covered, and there is no
need to walk the string every time you perform an indexing operation. But
it means that if you're one of the 99.9% of users who mostly use
characters in the BMP, your strings take twice as much space as
necessary. If you only use Latin1 or ASCII, your strings take four times
as much space as necessary.

I suspect that eventually, UTF-32 will win out. I'm not sure when
"eventually" is, but maybe sometime in the next 10-20 years.

When I was starting out, the computer industry had a variety of
character encodings designed to take up less than 8 bits per character.
Sixbit, Rad-50, BCD, and so on. Each of these added complexity and took
away character set richness, but saved a few bits. At the time, memory
was so expensive and so precious, it was worth it.

Over the years, memory became cheaper, address spaces grew from 16 to 32
to 64 bits, and the pressure to use richer character sets kept
increasing. So, now we're at the point where people are (mostly) using
Unicode, but are still arguing about which encoding to use because the
"best" complexity/space tradeoff isn't obvious.

At some point in the future, memory will be so cheap, and so ubiquitous,
that people will be wondering why us neanderthals bothered worrying
about trying to save 16 bits per character. Of course, by then, we'll
be migrating to Mongocode and arguing about UTF-64

rusi · Mar 16, 2013

I suspect that eventually, UTF-32 will win out. I'm not sure when
"eventually" is, but maybe sometime in the next 10-20 years.

There is an article by Tim O'Reilly IIRC that talks of a certain
prognostication that went wrong.
[If someone knows this article please give me the link]

The gist as I remember it was:
First there were audio cassettes and LPs.
Then came CDs with far better fidelity.
As Moore's law went its relentless way, the audio industry puts its
hope into formats that would double CD quality. Whereas the public
went with mp3s, ie a distinctly lower quality format, because putting
a thousand CDs into my pocket beats the pants of some super-duper hi-
fi new CD.
So while Moore's law takes its course, public demand and therefore big
money and therefore new standards may go some other way, including
reverse.

I believe that there are many things about unicode that are less than
satisfactory. Some are downright asinine like the 'prime-real-estate'
devoted to the control characters and never used.

In short, I am not betting on UTF-32.
Of course the reverse side also is there: Some of the world's most un-
optimal standards are also the most ubiquitous, like the qwerty
keyboard.

Roy Smith · Mar 16, 2013

rusi said:
I believe that there are many things about unicode that are less than
satisfactory. Some are downright asinine like the 'prime-real-estate'
devoted to the control characters and never used.

Ah, but in UTF-32, all real-estate is the same price

jmfauth · Mar 16, 2013

------

utf-32 is already here. You are all most probably [*]
using it without noticing it. How? By using OpenType fonts,
without counting the text processing applications using them.
Why? Because there is no other way to do it.

[*] depending of the font, the internal table(s), eg "cmap" table,
are in utf-16 or utf-32.

jmf

Roy Smith · Mar 16, 2013

Neil Hodgson said:
Low-level string manipulation often deals with blocks larger than
an individual character for speed. Generally 32 or 64-bits at a time
using the CPU or 128 or 256 using the vector unit. Then there may be
entry/exit code to handle initial alignment to a block boundary and
dealing with a smaller than block-size tail.

Duff's Device!

Taskcproblem calendar	4	Aug 31, 2023
Strange behavior for a 2D list	1	Apr 18, 2013
When I send email as HTML, why do erroneous whitespaces getintroduced to the HTML source and a few <	2	Nov 8, 2013
Pickling over a socket	13	Apr 19, 2011
Problem with heap	4	May 31, 2004
Engineering a List container Part 2: Implementations	20	Dec 8, 2013
Occurence problem: different ideas	3	May 7, 2006
Training Program for the IIBA™ CBAP™ Certification examination (BABOK®-Ver 1.6) by CBAP Certified fa	0	Oct 20, 2008

A reply for rusi (FSR)

Thomas 'PointedEars' Lahn

Mark Lawrence

Chris Angelico

Chris Angelico

Thomas 'PointedEars' Lahn

rusi

rusi

Mark Lawrence

Terry Reedy

rusi

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Roy Smith

rusi

Roy Smith

jmfauth

Roy Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads