wchar_t

James Brown · Nov 17, 2005

could someone please tell me when the wchar_t type was introduced into
the C language (and with what version).....perhaps it was introduced
as an extension by alot of compiler venders before it became official?

I am also interested in finding out what first prompted the introduction of
this type -
was it Unicode or did wchar_t happen before Unicode came into existence?

thanks,
James

Emmanuel Delahaye · Nov 17, 2005

James Brown a écrit :

could someone please tell me when the wchar_t type was introduced into
the C language (and with what version)

Addendum 1995

those who know me have no need of my name · Nov 18, 2005

in comp.lang.c i read:

could someone please tell me when the wchar_t type was introduced into
the C language (and with what version)

in the original standard, in 1989. though it was less than useful until
amd1 was adopted in 1995, and some might say remains less than successful.

lawrence.jones · Nov 18, 2005

James Brown said:
I am also interested in finding out what first prompted the introduction of
this type -
was it Unicode or did wchar_t happen before Unicode came into existence?

It was large character sets in general. At the time, the prevalent
large character sets and encodings were IBM's DBCS (the double-byte
version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
begun on what would become ISO 10646, but it was caught up in political
and technical turmoil between those who insisted that 32 bits were
required and those who thought that 16 were more than enough and far
more efficient. The latter camp had just broken away and started work
on a competing standard, Unicode. (Fortunately for everyone, cooler
heads prevailed and ISO 10646 and Unicode were eventually harmonized to
the point that most people now think they're the same thing.)

-Larry Jones

Let's pretend I already feel terrible about it, and that you
don't need to rub it in any more. -- Calvin

P.J. Plauger · Nov 18, 2005

It was large character sets in general. At the time, the prevalent
large character sets and encodings were IBM's DBCS (the double-byte
version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
begun on what would become ISO 10646, but it was caught up in political
and technical turmoil between those who insisted that 32 bits were
required and those who thought that 16 were more than enough and far
more efficient. The latter camp had just broken away and started work
on a competing standard, Unicode. (Fortunately for everyone, cooler
heads prevailed and ISO 10646 and Unicode were eventually harmonized to
the point that most people now think they're the same thing.)

Right. And the people who thought that 16 bits were more than enough
and far more efficient are *now* convinced that 21 bits are more
than enough and far more efficient. I give 'em five years, tops.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

websnarf · Nov 19, 2005

P.J. Plauger said:
Right. And the people who thought that 16 bits were more than enough
and far more efficient are *now* convinced that 21 bits are more
than enough and far more efficient. I give 'em five years, tops.

I'll take the other side of any bet you care to form based on that
statement. (Certainly I'll be recording this message for the
archives.)

BTW, the people who thought 32 bits (actually 31bits) was that way to
go *also* agree that (almost) 21 bits are more than enough. Currently
less than 17 bits are being used today (dominated by the east asian
characters), and the growth rate appears to be not worse than a
thousand new characters added per year. The kinds of things they are
considering these days are invented character sets (like an
accessibility alphabet called "blissymbolics", the script used for
Klingon and Elvish in the Lord of the Rings series ...) and really
obscure historical symbols (apparently "old hungarian" used an alphabet
that doesn't survive to today except by obscure historians in Hungary.)
This seems to be very asymptotic to me.

The problem with the 16 bits people (i.e., Microsoft, Sun, and I think
IBM) is that they were so stupid as to think that Asia wouldn't really
need complete character sets. They were basically being passively
racist. But, of course, money talks and there is a lot of commerce in
and with east asia, so they had to be accomodating. It turns out that
17 bits appears to be the right answer to get them all, but leaving no
expandability left at all is clearly insane. Having 21 bits means they
literally have more than 15 times as much space left over than what
they are currently using (again, remembering, that they're already
covering the really "big" east asian character sets).

The only remaining controversy (that I can tell) is the aliasing of
characters between the three major east asian languages. From what I'm
told, people in those countries don't seem to care about the subtle
problems that causes (you can't quote one language within another
unless you use some meta data, like a font change), and have gone full
steam ahead with dropping Big5 and adopting Unicode pretty pervasively.

You think they'll run out in 5 years? Personally, I think they're
done.

P.J. Plauger · Nov 19, 2005

The only remaining controversy (that I can tell) is the aliasing of
characters between the three major east asian languages. From what I'm
told, people in those countries don't seem to care about the subtle
problems that causes (you can't quote one language within another
unless you use some meta data, like a font change), and have gone full
steam ahead with dropping Big5 and adopting Unicode pretty pervasively.

You think they'll run out in 5 years? Personally, I think they're
done.

Here's a coarse scale or two, just from personal experience.

-- Number of address bits required to address a "large" memory:

1960 15 IBM 7090
1970 20 IBM 360
1980 25 VAX 11/780
1990 30 various
2000 35 various

-- Number of bits required to represent a (commonly used)
character set:

1960 6 numerous vendor-specific codes
1970 7 7-bit ASCII
1980 8 extended ASCII
1990 16 DBCS and others
2000 21 Unicode

I could make a similar table of "barely adequate" communication
speeds, which also continue to expand exponentially.

So long as you think in terms of linear increases in demand
for bytes or characters, it's easy to believe at each stage
that you're through expanding. After all, you currently have
a bit of headroom, and what possible need can there be for
much larger programs/character sets?

I personally can't imagine that people will ever want to
define common attribute bits for, say:

-- roman, italic, bold, underscore
-- red, green, blue
-- point size
-- font

But if we did, each attribute bit would double the number
of effective character codes, wouldn't it?

Nor can I imagine that a large government like China might
thumb its nose at an international standard and, say,
require a parallel set of many ISO 10646 codes.

For over 40 years I've been reading regular articles by
pundits who explain why larger/faster hardware is a waste
of time and will never sell. They've all been wrong. And
the further back in time you look, the greater the redshift
in the predictions.

So, you may well be right that the need for larger
character sets has finally come to an end. I'll wait
and see. Meanwhile, I make sure that the code I write
will work with 32- (not 31-) bit character sets. With
any luck, the code will have adequate capacity until
I retire...

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

James Brown · Nov 19, 2005

It was large character sets in general. At the time, the prevalent
large character sets and encodings were IBM's DBCS (the double-byte
version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
begun on what would become ISO 10646, but it was caught up in political
and technical turmoil between those who insisted that 32 bits were
required and those who thought that 16 were more than enough and far
more efficient. The latter camp had just broken away and started work
on a competing standard, Unicode. (Fortunately for everyone, cooler
heads prevailed and ISO 10646 and Unicode were eventually harmonized to
the point that most people now think they're the same thing.)

-Larry Jones

Let's pretend I already feel terrible about it, and that you
don't need to rub it in any more. -- Calvin

thanks! (to everyone) for the very informative answers..

cheers,
James

Skarmander · Nov 19, 2005

I'll mark it OT, since we've left C behind quite a bit by now.

P.J. Plauger said:
Here's a coarse scale or two, just from personal experience.

-- Number of address bits required to address a "large" memory:

1960 15 IBM 7090
1970 20 IBM 360
1980 25 VAX 11/780
1990 30 various
2000 35 various

Nice, but this misses a point: there is an upper limit. Address bits
will not continue to grow indefinitely, because there is an upper limit
to the amount of information that will fit in the universe. Or maybe
there isn't, but then we're talking a radical shift in physics, which
may happen but doesn't allow for fair comparison anymore.

-- Number of bits required to represent a (commonly used)
character set:

1960 6 numerous vendor-specific codes
1970 7 7-bit ASCII
1980 8 extended ASCII
1990 16 DBCS and others
2000 21 Unicode

I could make a similar table of "barely adequate" communication
speeds, which also continue to expand exponentially.

But again: it can't go on forever. The question here, therefore, is
whether we've reached the end of the line, not whether exponential
expansion is happening.

So long as you think in terms of linear increases in demand
for bytes or characters, it's easy to believe at each stage
that you're through expanding. After all, you currently have
a bit of headroom, and what possible need can there be for
much larger programs/character sets?

Don't think this question hasn't been asked, unlike those people who
asserted that "640K ought to be enough for anybody" (which Bill Gates
famously never said) or "16 bits ought to be enough, since it's better
than wasting 32 bits". Unicode doesn't say "21 bits ought to be enough
for anybody". It can say "21 bits is enough for every character known to
man", because it is. Unlike memory, communication speed and a host of
other things that keep growing, there is a conceivable upper limit, and
it is not that unreasonable to state we're close to it.

I personally can't imagine that people will ever want to
define common attribute bits for, say:

-- roman, italic, bold, underscore
-- red, green, blue
-- point size
-- font

But if we did, each attribute bit would double the number
of effective character codes, wouldn't it?

That's why Unicode doesn't work that way, and no character set ever has.
They encode *characters*, not *glyphs*. A glyph is what you see on your
screen, and it may have many nice properties by which it is affected,
including the formatting characteristics you describe. But a Roman
capital letter A is a Roman capital letter A, no matter what style,
color, size or font it happens to be displayed in. Being able to leave
these things unstated will always remain useful.

Actually, "glyph sets" were (and probably still are) in common use for
display on dumb terminals with hardwired character sets (and probably
some applications for not so dumb terminals, too). Remember when the
character set was 7-bit ASCII and the terminals extended this to an
8-bit glyph set with the upper bit meaning "reverse video"? That's this.

The point is, effective comparison stops being useful at this point,
because you've shifted the way you look at what a code point represents.
As the Unicode FAQ itself states:

"Both Unicode and ISO 10646 have policies in place that formally limit
future code assignment to the integer range that can be expressed with
current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e.
other UTFs) can represent larger intergers, these policies mean that all
encoding forms will always represent the same set of characters. Over a
million possible codes is far more than enough for the goal of Unicode
of encoding characters, not glyphs. Unicode is not designed to encode
arbitrary data. If you wanted, for example, to give each 'instance of a
character on paper throughout history' its own code, you might need
trillions or quadrillions of such codes; noble as this effort might be,
you would not use Unicode for such an encoding."

Here's a more interesting thing to think about than adding "blink" bits:
suppose we encounter extraterrestrial cultures one day, and we want to
synch character sets eventually... *Then* Unicode may become
insufficient. But I don't think it would be fair to blame the current
standard for that.

Nor can I imagine that a large government like China might
thumb its nose at an international standard and, say,
require a parallel set of many ISO 10646 codes.

It already thumbs its nose to some extent. Unicode is still viewed with
great suspicion in some parts of the Eastern world, and alternate
character sets continue to be in use. But the Chinese government can
require of ISO 10646 what it wants; it's not likely to get it if it
can't be supported by technical requirements, as opposed to politics.
Maybe you can slip in one character that's spurious that way, but not a
few thousand. Maybe when the Chinese achieve global domination and
abolish our preposterous 21-bit standards, but not before.

For over 40 years I've been reading regular articles by
pundits who explain why larger/faster hardware is a waste
of time and will never sell. They've all been wrong. And
the further back in time you look, the greater the redshift
in the predictions.

These arguments do not cleanly translate to character sets, your little
tables notwithstanding. The upper limit may not be 21 bits, but if
that's not the upper limit, it's pretty close to it in orders of
magnitude. If people one day decide to abandon the concept of "character
set" and go crazy stuffing all sorts of attributes in it (adopting
"glyph sets"), that's a clear change in application, unlike increased
hardware capacity. It will be fueled by the *ability* to use such sets
efficiently, not the *need* to do this.

So, you may well be right that the need for larger
character sets has finally come to an end. I'll wait
and see. Meanwhile, I make sure that the code I write
will work with 32- (not 31-) bit character sets. With
any luck, the code will have adequate capacity until
I retire...

Fortunately for you, writing code that can handle both 21-bit and 32-bit
character sets is hardly a challenge, given the current state of
computer hardware. Even if Unicode had to grow someday (which would have
to mean a new standard, of course), it wouldn't exactly be hard to
implement, at least not as far as code point size is concerned.

S.

websnarf · Nov 19, 2005

P.J. Plauger said:
Here's a coarse scale or two, just from personal experience.

-- Number of address bits required to address a "large" memory:

1960 15 IBM 7090
1970 20 IBM 360
1980 25 VAX 11/780
1990 30 various
2000 35 various

-- Number of bits required to represent a (commonly used)
character set:

1960 6 numerous vendor-specific codes

Used only by computer scientists. (Commerce on computing being
non-existent.)

1970 7 7-bit ASCII

Used only in english speaking countries.

1980 8 extended ASCII

Used only in english, and *some* european countries.

1990 16 DBCS and others

A nonsensical hack.

2000 21 Unicode

Used in 100% of all computer using countries (and built to scale to
those that don't).

The only potential for future growth here will come from the SETI
project.

I could make a similar table of "barely adequate" communication
speeds, which also continue to expand exponentially.

So long as you think in terms of linear increases in demand
for bytes or characters, it's easy to believe at each stage
that you're through expanding. After all, you currently have
a bit of headroom, and what possible need can there be for
much larger programs/character sets?

There is nowhere to scale, and the head room is overkill. We would
have to add at least 16 languages of similar complexity to the
east-asian ones before the encoding space was at risk.

I personally can't imagine that people will ever want to
define common attribute bits for, say:

-- roman, italic, bold, underscore
-- red, green, blue
-- point size
-- font

But if we did, each attribute bit would double the number
of effective character codes, wouldn't it?

So you haven't read anything about Unicode at all have you? Unicode
does *not* specify meta-information. Those kinds of data will never be
put into the Unicode standard, and are not considered part of the text
data that Unicode specifies.

This also belies an ignorance of what Unicode is specifying. Do you
think it makes sense to have the accent of one character in a different
font or size than its base character? Even if you wanted to encode
this (which I think the east-asians may need in some cases of cross
multi-language applications), obviously such meta-data specification
would be encoding as escaped *modes*. This is easily encoded in the
"private data area" ranges in application specific ways. But most
people use meta-display formatting languages, like HTML, or the Open
document format, or MS Word, something like that to encode such things
today.

Nor can I imagine that a large government like China might
thumb its nose at an international standard and, say,
require a parallel set of many ISO 10646 codes.

Why would they do this? The closest thing to China setting policy on
anything regarding computing standards is their adoption of Red Flag
Linux. Linux uses Unicode as its internationalization mechanism. I
don't think China wants to give up on the commerce that relies on this
standardization (i.e., all of it.)

For over 40 years I've been reading regular articles by
pundits who explain why larger/faster hardware is a waste
of time and will never sell. They've all been wrong. And
the further back in time you look, the greater the redshift
in the predictions.

That is because they always underestimate the scale and growth in the
problem being solved. By analogy you are suggesting that human
languages and the character sets we use will be increasing over time in
an increasing and exponential way similar to the growth of programming
applications.

So, you may well be right that the need for larger
character sets has finally come to an end. I'll wait
and see. Meanwhile, I make sure that the code I write
will work with 32- (not 31-) bit character sets.

Are you going to invent your own standard? UTF-32 encodes 31 bits (the
top bit is assumed to be 0, otherwise an encoding error can be
assumed). UTF-8 encodes at most 31 bits (this is a physical encoding
limitation). And UTF-16 encodes a little under 21 bits (again, a
physical encoding limitation). The only *valid* encodings are the
intersection of these which is essentially the UTF-16 encoding.

[...] With any luck, the code will have adequate capacity until I retire...

Also, just arbitrarily thinking "characters are 32 bits" are less that
useful to people who actually want to encode and use Unicode data. For
example, string comparison and collation cannot be done with a simple
byte comparison, and character counts do not correspond to the length
of the encoded data. If you don't encode actual Unicode semantics
(i.e., you use wchar_t instead) then "adequate" is not something anyone
is going to consider your implementation.

Richard Tobin · Nov 19, 2005

And UTF-16 encodes a little under 21 bits

A little over 20 bits would be more accurate: 2^20 + 2^16.

-- Richard

P.J. Plauger · Nov 19, 2005

So you haven't read anything about Unicode at all have you?

Actually, I have.

Unicode
does *not* specify meta-information. Those kinds of data will never be
put into the Unicode standard, and are not considered part of the text
data that Unicode specifies.

What, never? You may very well be right.

This also belies an ignorance of what Unicode is specifying. Do you
think it makes sense to have the accent of one character in a different
font or size than its base character?

Does it make sense to have several different ways to express the
same "character", some involving multiple codes in arbitrary order?
Particluarly when there's a one-element version that does the job?
Who would do a thing like that in an international standard?

Why would they do this? The closest thing to China setting policy on
anything regarding computing standards is their adoption of Red Flag
Linux. Linux uses Unicode as its internationalization mechanism. I
don't think China wants to give up on the commerce that relies on this
standardization (i.e., all of it.)

That's not what I've heard.

That is because they always underestimate the scale and growth in the
problem being solved.

Uh huh.

By analogy you are suggesting that human
languages and the character sets we use will be increasing over time in
an increasing and exponential way similar to the growth of programming
applications.
Yep.

Are you going to invent your own standard?

No. But I've already invented my own worst-case *machinery*
for handling a variety of standards. Differen thing.

UTF-32 encodes 31 bits (the
top bit is assumed to be 0, otherwise an encoding error can be
assumed). UTF-8 encodes at most 31 bits (this is a physical encoding
limitation). And UTF-16 encodes a little under 21 bits (again, a
physical encoding limitation). The only *valid* encodings are the
intersection of these which is essentially the UTF-16 encoding.

At the moment yes. A few years ago, it was UCS-2.

[...] With any luck, the code will have adequate capacity until I
retire...

Click to expand...

Also, just arbitrarily thinking "characters are 32 bits" are less that
useful to people who actually want to encode and use Unicode data.

May be true. I didn't say I was.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

P.J. Plauger · Nov 19, 2005

A little over 20 bits would be more accurate: 2^20 + 2^16.

Right. And the minimum number of real bits needed to express
20.087463 bits is...?

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

websnarf · Nov 19, 2005

Richard said:
A little over 20 bits would be more accurate: 2^20 + 2^16.

Ah yes, I misremembered this. But you forgot to subtract out the
escape hole itself:

2^20 + 2^16 - 2*2^(20/2).

Then you can take into account that U+FFFF is always illegal, and at
that only one of the two encodings: (xFFFE) or (xFEFF) can be legal in
any single given stream (what this means is that once decoded, U+FEFF
is legal (and a basically content-free code point), while U+FFFE is
not):

2^20 + 2^16 - 2^11 - 2.

All these complications coming from UTF-16, but which have to be
adopted by the other encodings (except the 0xFEFF nonsense) just to
make them all consistent.

Then I don't know how you want count the unassigned code points that
are clearly within the range of certain code point categories. You
*know* that those values will never be assigned and never have meaning,
but they are not explicitely marked as illegal.

Richard Tobin · Nov 19, 2005

A little over 20 bits would be more accurate: 2^20 + 2^16.

Right. And the minimum number of real bits needed to express
20.087463 bits is...?[/QUOTE]

What is your point?

-- Richard

P.J. Plauger · Nov 19, 2005

I'll mark it OT, since we've left C behind quite a bit by now.

Not entirely, since this discussion goes to the very heart of
why we (X3J11) made wchar_t a flexible type, much to the dismay
of the various jingoists who know what the *right* representation
should be. (Hint: they don't all agree.)

Nice, but this misses a point: there is an upper limit.

Okay, what *is* that upper limit? That *was* the point.
Does anybody dare freeze it now?

Address bits will
not continue to grow indefinitely, because there is an upper limit to the
amount of information that will fit in the universe. Or maybe there isn't,
but then we're talking a radical shift in physics, which may happen but
doesn't allow for fair comparison anymore.

Good. Now tell me the practical upper limit that we can use
to standardize the all-singing, all-dancing physical address
for now and all future times.

But again: it can't go on forever. The question here, therefore, is
whether we've reached the end of the line, not whether exponential
expansion is happening.

Yes, that's *exactly* the question I raised.

Don't think this question hasn't been asked,

I indeed *don't* think that. In fact, I believe I said something
quite along those lines.

unlike those people who
asserted that "640K ought to be enough for anybody" (which Bill Gates
famously never said) or "16 bits ought to be enough, since it's better
than wasting 32 bits". Unicode doesn't say "21 bits ought to be enough for
anybody". It can say "21 bits is enough for every character known to man",
because it is. Unlike memory, communication speed and a host of other
things that keep growing, there is a conceivable upper limit, and it is
not that unreasonable to state we're close to it.

It may not be unreasonable, but I maintain that, on the basis of
history, it's wildly optimistic. IIRC, SC2/WG2 (the ISO committee
corresponding to the Unicode Consortium) even saw fit to pass
a resolution that UTF-16 will forever more be adequate to express
all expansions of ISO 10646 (the ISO standard corresponding to
Unicode). I consider that either a) a mark of remarkable self
confidence, or b) whistling in the dark. Take your pick.

That's why Unicode doesn't work that way, and no character set ever has.
They encode *characters*, not *glyphs*.

I do understand that. Admittedly, the example of one possible
cause for exponential expansion was a lightning rod.

The point is, effective comparison stops being useful at this point,
because you've shifted the way you look at what a code point represents.
As the Unicode FAQ itself states:

"Both Unicode and ISO 10646 have policies in place that formally limit
future code assignment to the integer range that can be expressed with
current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other
UTFs) can represent larger intergers, these policies mean that all
encoding forms will always represent the same set of characters. Over a
million possible codes is far more than enough for the goal of Unicode of
encoding characters, not glyphs. Unicode is not designed to encode
arbitrary data. If you wanted, for example, to give each 'instance of a
character on paper throughout history' its own code, you might need
trillions or quadrillions of such codes; noble as this effort might be,
you would not use Unicode for such an encoding."

So I did RC. The question I raised, however, was whether Unicode can
resist the inevitable pressures to grow beyond their currently
self-imposed barrier of 1,114,112 codes. Again IIRC, the Unicode
Consortium parted company with SC2/WG2 years ago because the former
body was convinced that 65,536 codes would be enough and the latter
was intent on leaving room for 2^31. Microsoft and Sun backed that
play, with Windows and Java (among other products) and now they
have to wrestle with the inconvenience of UTF-16. BTW, I haven't
noticed anybody in the Unicode camp blushing at their earlier
hubris.

It already thumbs its nose to some extent. Unicode is still viewed with
great suspicion in some parts of the Eastern world, and alternate
character sets continue to be in use. But the Chinese government can
require of ISO 10646 what it wants; it's not likely to get it if it can't
be supported by technical requirements, as opposed to politics.

Oh, my, I think you really believe that. When "politics" is backed
by the odd billion dollars worth of contracts, you'd be surprised
what it can get.

These arguments do not cleanly translate to character sets, your little
tables notwithstanding. The upper limit may not be 21 bits, but if that's
not the upper limit, it's pretty close to it in orders of magnitude.

Okay. My "argument" was that 21 bits will not long prove to be enough.
Just one order of magnitude will be enough to blow UTF-16 to kingdom
come. And that was my point.

Fortunately for you, writing code that can handle both 21-bit and 32-bit
character sets is hardly a challenge, given the current state of computer
hardware. Even if Unicode had to grow someday (which would have to mean a
new standard, of course), it wouldn't exactly be hard to implement, at
least not as far as code point size is concerned.

Also my point. Having just survived several years of UTF-16
jingoism, however, I expect to be ungracious if Unicode does
indeed have to issue a new standard that leaves UTF-16 in the
same rest home as UCS-2. I also hope to remain intellectually
honest enough to issue a mea culpa in five years if I prove
to be wrong.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

P.J. Plauger · Nov 19, 2005

Right. And the minimum number of real bits needed to express
20.087463 bits is...?

What is your point?[/QUOTE]

That saying "a little over 20 bits" rather than 21 bits
is asinine nit picking.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Richard Tobin · Nov 20, 2005

That saying "a little over 20 bits" rather than 21 bits
is asinine nit picking.

It wasn't "21 bits", it was "a little under 21 bits".
I thought that might reflect a misunderstanding by the poster.

My posting was intended to be helpful, yours was just rude.

-- Richard

Dik T. Winter · Nov 20, 2005

> news:[email protected]... ....

I may note that 6, 7 and 8 bits were barely adequate, even in those
years. That is why there was a plethora of standards. With the
current version of Unicode, under 100,000 positions are filled with
charactes, so it could be done in 17 bits (but actually three planes
are used for it). The base plane has slightly over 50,000 code points
used, plane 1 only something less than 2500 (used for archaic scripts)
and plane 2 under 45,000 (used for archaic Chinese characters). It is
likely that all current and past scripts will fit into those 21 bits,
and it is unlikely that new scripts will be invented. So the need
for more than 21 bits is unlikely to come up.

>
> It may not be unreasonable, but I maintain that, on the basis of
> history, it's wildly optimistic. IIRC, SC2/WG2 (the ISO committee
> corresponding to the Unicode Consortium) even saw fit to pass
> a resolution that UTF-16 will forever more be adequate to express
> all expansions of ISO 10646 (the ISO standard corresponding to
> Unicode). I consider that either a) a mark of remarkable self
> confidence, or b) whistling in the dark. Take your pick.

It is possible to estimate the number of symbols used in current and
archaic scripts. And you can be pretty confident that that number will
not grow very much in time.

>
> Oh, my, I think you really believe that. When "politics" is backed
> by the odd billion dollars worth of contracts, you'd be surprised
> what it can get.

Strange enough, most Chinese code points are derived from Taiwanese
standards (actually the vast majority).

>
> Okay. My "argument" was that 21 bits will not long prove to be enough.
> Just one order of magnitude will be enough to blow UTF-16 to kingdom
> come. And that was my point.

Well, as currently 18 bits would fit to encode all defined code points
in Unicode with ease, we have still a long way to go. In the history
of Unicode, the large increases were with 3.0 (an increase with 10,307
code points) and with 3.1 (an increase with 44,978 code points, mostly
plane 2). The initial set had 29,929 code points (1991). Other
increases have been in the order of 4,000 to 5,000 (the initial years)
and 1,000 to 1,500 (since 2001). It is really pretty certain that no
script will be found that demands an increase as large as the increase
in 1999 or 2001. With the current growth, Unicode will be filled with
version 40.0 or something like that, in about 180 years.

> Also my point. Having just survived several years of UTF-16
> jingoism, however, I expect to be ungracious if Unicode does
> indeed have to issue a new standard that leaves UTF-16 in the
> same rest home as UCS-2. I also hope to remain intellectually
> honest enough to issue a mea culpa in five years if I prove
> to be wrong.

The probability that UTF-16 is not enough in five years is 0. Even
if the usage of code points doubles each five years (which is a faster
growth than in the first 15 years), it will be sufficient until 2020.

Dik T. Winter · Nov 20, 2005

> Does it make sense to have several different ways to express the
> same "character", some involving multiple codes in arbitrary order?
> Particluarly when there's a one-element version that does the job?
> Who would do a thing like that in an international standard?

There is such a thing as round-trip compatibility with other standards.
Meaning that if there is another standard each code in that other standard
should translate to a single code point in Unicode, and should not use
multiple codes. For the CJK set these are the compatibility regions.
It is a bit unlucky that due to that an a with acute can be encoded both
in a single code and in the code for the letter a and the non-spacing
acute. But actually the former corresponds to ISO-8859 and the latter
to ASCII. We are lucky that there is no Indian standard that encoded
the different ligatures used in Davanagari (or any of the other Indian
scripts). Also there is no Arabic standard that had different encodings
for the letters depending on position. Korean grew large in Unicode
because, although a letter script, there *was* a coding that encoded
syllables. So I think there are good reasons.

wchar_t is useless	18	Nov 21, 2011
wide character file to wstring - unexpected results	1	Dec 14, 2011
attempting to print unicode characters.	23	Aug 29, 2010
I'm tempted to quit out of frustration	1	Aug 13, 2023
Is char obsolete?	20	Apr 8, 2011
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
C++, wchar_t, Unicode and all that stuff	3	Dec 23, 2005
Help with policy based design	3	Jul 8, 2009

wchar_t

James Brown

Emmanuel Delahaye

those who know me have no need of my name

lawrence.jones

P.J. Plauger

websnarf

P.J. Plauger

James Brown

Skarmander

websnarf

Richard Tobin

P.J. Plauger

P.J. Plauger

websnarf

Richard Tobin

P.J. Plauger

P.J. Plauger

Richard Tobin

Dik T. Winter

Dik T. Winter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads