The need of Unicode types in C++0x

James Kanze · Oct 2, 2008

Yannick Tremblay wrote:

[...]

True, but I think Unicode locales could be implemented for characters
only, leaving the rest unchanged (as they are).

For example:

would change only the character set, keeping the rest of the
locale settings as they are either they were previously
defined or they are the default ones.

That's not quite how locales work. What I think your talking
about is a UTF16 codecvt facet. And there are ways of
constructing a local by copying another locale, just replacing a
single facet. Of course, the ctype facet is also affected; part
of the problem in doing this cleanly is that abstractions that
we'd like to keep separate get mixed up. (Note that this can be
a problem even within a pure Unicode environment. Something
like toupper( 'i' ) is locale dependent, and will return a
different character in a Turkish locale.)

James Kanze · Oct 2, 2008

James Kanze wrote:

How would they "break" their existing code base, by adding
some additional locales and even changing the size of wchar_t?

Adding locales is no problem. Changing the size, or anything
involving the behavior of wchar_t breaks real code. Some of the
code is probably poorly written, but convincing your customers
that they are idiots doesn't sell many compilers.

Ioannis Vranos · Oct 2, 2008

James said:
Adding locales is no problem. Changing the size, or anything
involving the behavior of wchar_t breaks real code. Some of the
code is probably poorly written, but convincing your customers
that they are idiots doesn't sell many compilers.

OK, but if their badly-written code is broken, they will fix it.

Hendrik Schober · Oct 2, 2008

Ioannis said:
OK, but if their badly-written code is broken, they will fix it.

For most of the past ten years I have written code that
had to be compiled using halve a dozen compiler/std lib
combinations on so many platforms. We had the very same
code carry UTF-8 strings on some Linux versions, UTF-16
on Windows, and UTF-32 on OSX and some other Unices. We
have learned to deal with all data types being platform-
dependent and our code needing to adapt.
Still, if your vendor does something stupid (like when VC
suddenly started to throw several 10k of useless warnings
for a 2MLoc code base that used to compile clean), you're
doomed.
And this isn't any different when you got yourself into
the trouble yourself. Even if you know that, 15 years ago,
some (people who had long left the company when you came,
and the company was a very different one back then, and
the code's been bought several times over) did something
stupid, it doesn't mean that, now you have several MLoC
relying on a specific size of some built-in type, you can
spend several man-years fixing this and take another two
releases until the dust has settled and all the bugs you
introduced doing so are fixed. While that would be nice
to do, the customers won't pay for it.

C++ has always respected the gazillions of lines of legacy
code real-world projects have. That's probably a reason
for its success.

Schobi

Hendrik Schober · Oct 2, 2008

James said:
James Kanze wrote: [...]

In what encoding format? And what if the "usual" encoding for
wstring isn't Unicode (the case on many Unix platforms).

Click to expand...

Click to expand...

<curious>
What are those implementations using for 'wchar_t'?
</curious>

Click to expand...

EUC. EUC (= Extended Unix Codes) is originally a multi-byte
code, but exists as a 32 bit code as well, see
http://docs.sun.com/app/docs/doc/802-1950/6i5us7asn?l=en&a=view.
It's apparently the standard encoding for wchar_t under Solaris
and HP/UX, and perhaps elsewhere as well. Thus, LATIN SMALL
LETTER E WITH ACUTE has the code 0x00E9 in Unicode, but
0x30000069 under Solaris. (``printf( "%04x\n", (unsigned
int)L'é''') -- the compiler apparently recognizes my
LC_CTYPE=iso_8859_1 locale for the file input.)

Thanks!

Schobi

James Kanze · Oct 3, 2008

James Kanze wrote:

OK, but if their badly-written code is broken, they will fix
it.

I don't guess you've ever worked in industry. The authors of
the code will claim that it's the compiler which is broken, and
find one which accepts it.

And of course, some of the code that would break probably isn't
broken. If you have no real portability requirements, and you
have a guarantee that wchar_t contains EUC, what's wrong about
programming against that. And you have that guarantee.

Practically speaking, it's easy to add new features---about the
only thing adding char32_t et al. can break is code which used
those symbols as keywords. Where as the standard, and vendor
specifications are a contract, which you really can't change
without wrecking havoc. And if you're a vendor, loosing sales.

James Kanze · Oct 3, 2008

[...]

C++ has always respected the gazillions of lines of legacy
code real-world projects have. That's probably a reason
for its success.

Were it only so. One of the reasons why there was so much
interest in Java was because it was so difficult to write
portable C++, and because the language was felt to be changing
under you. We've had to rework quite a bit of code, including
reorganizing some, because of two phase look-up, and the
differences between the classical iostream and the standard one
have caused more than a few problems as well.

C++0x "auto" equivalence in non-0x? (function needs to return undeterminedtype value)	15	May 6, 2011
unicode mess in c++	12	May 11, 2006
About adoption of diagnostic messages on non-explicit initialisationof POD types in the C++ standard	2	Apr 17, 2009
C++ danger to break due to its weight, fragmentation danger - C++0x	14	Apr 19, 2004
Locale/UTF-8 file path with std::ifstream	2	Feb 8, 2008
How should I handle the multibyte char set string in C++?	10	Apr 29, 2007
Some errors in MIT's intro C++ course	109	Sep 8, 2010
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014

The need of Unicode types in C++0x

James Kanze

James Kanze

Ioannis Vranos

Hendrik Schober

Hendrik Schober

James Kanze

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads