UTF-16 & wchar_t: the 2nd worst thing about C++

  • Thread starter Steven T. Hatton
  • Start date
S

Steven T. Hatton

This is one of the first obstacles I encountered when getting started with
C++. I found that everybody had their own idea of what a string is. There
was std::string, QString, xercesc::XMLString, etc. There are also char,
wchar_t, QChar, XMLCh, etc., for character representation. Coming from
Java where a String is a String is a String, that was quite a shock.

Well, I'm back to looking at this, and it still isn't pretty. I've found
what appears to be a way to go between QString and XMLCh. XMLCh is
reported to be UFT-16. QString is documented to be the same. QString
provides very convenient functions for 'converting'[*] between NTBS const
char*, std::string and QString[**]. So, using QString as an intermediary,
I can construct a std::string from a const XMLCh* NTBS, and a const XMLCh*
NTBS from a std::string.

My question is whether I can do this without the QString intermediary. That
is, given a UTF-16 NTBS, can I construct a std::string representing the
same characters? And given a std::string, can I convert it to a UTF-16
NTBS? I have been told that some w_char implementations are UTF-16, and
some are not.

My reading of the ISO/IEC 14882:2003 is that implementations must support
the UTF-16 character set[***], but are not required to use UTF-16 encoding.
The proper way to express a member of the UTF-16 character set is to use
the form \UNNNNNNNN[****], where NNNNNNNN is the universal-character-name,
or \UNNNN, where NNNN is the character short name of the a
universal-character-name whose value is \U0000NNNN, unless the character is
a member of the basic character set, or if the hexadecimal value of the
character expressed is less than 0x20, or if the hexadecimal value of the
character expressed is in the range 0x7f-0x9f (inclusive). Members of the
UTF-16 character set which are also members of the basic character set are
to be expressed using their literal symbol in an L-prefixed character
literal, or an L-prefixed string literal.

This tells me that the UTF-16 defined by the Xerces XMLCh does not conform
to the definition of the extended character set of a C++ implementation.
http://xml.apache.org/xerces-c/apiDocs/XMLUniDefs_8hpp-source.html

Is my understanding of this situation correct?

UTF-16 seems to be a good candidate for a lingua Franka of runtime character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?

[*] here 'converting' means either type conversion or constructing a new
variable to hold a different encoding of the same characters sequence.

[**]Leaving aside unanswered questions such as whether knowing that both are
encoded as UTF16 is sufficient to assume the representations are identical.

[***]Here UTF-16 is used as a synonym for UCS-2 described in ISO/IEC 10646
"Universal Multiple-Octet Coded Character Set", though there may be subtle
differences.
[****] the case of the 'U' in \UNNNNNNNN or \UNNNN is irrelevant.

So what is the worst thing about C++? '#'
 
S

Steven T. Hatton

Steven said:
UTF-16 seems to be a good candidate for a lingua Franka of runtime
character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?

To answer my own question, that appears to be partially correct. The
implementation must provide support for the subset of UTF-16 required by
all of its supported locales.

I asked for some clarification as to why Xerces-C uses a differnet data type
than wchar_t to hold UTF-16. One person responded by arguing that the
standard does not specify that wchar_t is 16 bits. He pointed out that GCC
uses a 32 bit data type for wchar_t. In practice, does that matter? I
mean, in the real world, will any system use a different amount of physical
memory for a 32 bit data type than for a 16 bit type?
 
G

Greg

Steven said:
To answer my own question, that appears to be partially correct. The
implementation must provide support for the subset of UTF-16 required by
all of its supported locales.

I asked for some clarification as to why Xerces-C uses a differnet data type
than wchar_t to hold UTF-16. One person responded by arguing that the
standard does not specify that wchar_t is 16 bits. He pointed out that GCC
uses a 32 bit data type for wchar_t. In practice, does that matter? I
mean, in the real world, will any system use a different amount of physical
memory for a 32 bit data type than for a 16 bit type?

Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?

So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.

Greg
 
S

Steven T. Hatton

Greg said:
Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?

On a 32 bit system the unit of data processed by each instruction is 32
bits. That means that storing two 16 bit values in one 32 bit word would
require some kind of packing and unpacking. Perhaps I am wrong, but my
understanding is that such processor overhead is typically not expended
when dealing with in-memory data.
So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.

Do you have any benchmark examples to demonstrate this?
 
J

Jakob Bieling

Steven T. Hatton said:
This is one of the first obstacles I encountered when getting started
with C++. I found that everybody had their own idea of what a string
is. There was std::string, QString, xercesc::XMLString, etc. There

A string in C++ is an std::string and nothing else. All the QString
and XMLString stuff you found are just reinventions of the wheel. They
might (and should, imho) have used the Standard C++ std::string type.
are also char, wchar_t, QChar, XMLCh, etc., for character

Same here. According to the Standard a char is "large enough to
store any member of the implementation's basic character set" (3.9.1/1)
and a wchar_t "is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales" (3.9.1/5).
representation. Coming from Java where a String is a String is a
String, that was quite a shock.

You can't keep people from reinventing the wheel .. :|

I am not going into the problems you described with those
reinventions, as I am not familiar with those. But converting strings in
C++ to wide-strings and back could be done this way:

#include <string>

inline std::wstring widen (std::string const& s)
{
return std::wstring (s.begin (), s.end ());
}

inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}

int main (int argc, char* argv [])
{
std::string s = "hello world";
std::wstring s1 = widen (s);
std::string s2 = narrow (s1);
}

Now I am sure you can create similar functions to convert from and
to those other string types.

If you need to use all three of those string types, my advice is to
make the Standard string types your internal string types (those you
work with) and not litter your code with non-standard string classes.
Then convert your std::string/std::wstring objects to whatever string
class is required.

regards
 
P

peter koch

Greg said:
Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?

I can not see how using a variable-length character set can be faster
than using a fixed size one. In my opinion, UTF16 gives you nothing
compared to UTF8 unless you totally ignore the fact that the data might
be encoded. But if you do that, how are you going to react to input
from the surrounding world?

/Peter
 
P

Pete Becker

Steven said:
On a 32 bit system the unit of data processed by each instruction is 32
bits. That means that storing two 16 bit values in one 32 bit word would
require some kind of packing and unpacking. Perhaps I am wrong, but my
understanding is that such processor overhead is typically not expended
when dealing with in-memory data.

On most systems, an 8-bit byte is the fundamental addressable storage
unit. Simple char arrays use one byte per char. When wchar_t is 16 bits
it occupies two bytes. When it's 32 bits it occupies four bytes. Try it:

int main()
{
char cvalues[2];
printf("%p %p\n", &cvalues[0], &cvalues[1]);
short svalues[2]; // assuming 16-bit short
printf("%p %p\n", &svalues[0], &svalues[1]);
return 0;
}
 
S

Steven T. Hatton

Jakob said:
A string in C++ is an std::string and nothing else. All the QString
and XMLString stuff you found are just reinventions of the wheel. They
might (and should, imho) have used the Standard C++ std::string type.


Same here. According to the Standard a char is "large enough to
store any member of the implementation's basic character set" (3.9.1/1)
and a wchar_t "is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales" (3.9.1/5).


You can't keep people from reinventing the wheel .. :|

I am not going into the problems you described with those
reinventions, as I am not familiar with those.
<quote url="http://xml.apache.org/xerces-c/build-misc.html#XMLChInfo">
What should I define XMLCh to be?

XMLCh should be defined to be a type suitable for holding a utf-16 encoded
(16 bit) value, usually an unsigned short.

All XML data is handled within Xerces-C++ as strings of XMLCh characters.
Regardless of the size of the type chosen, the data stored in variables of
type XMLCh will always be utf-16 encoded values.

Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is
utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is
not based on Unicode at all (HP/UX, AS/400, system 390).

Some earlier releases of Xerces-C++ defined XMLCh to be the same type as
wchar_t on most platforms, with the goal of making it possible to pass
XMLCh strings to library or system functions that were expecting wchar_t
parameters. This approach has been abandoned because of

* Portability problems with any code that assumes that the types of XMLCh
and wchar_t are compatible

* Excessive memory usage, especially in the DOM, on platforms with 32 bit
wchar_t.

* utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t
on Solaris and Linux. The problem occurs with Unicode characters with
values greater than 64k; in ucs-4 the value is stored as a single 32 bit
quantity. With utf-16, the value will be stored as a "surrogate pair" of
two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still
create the utf-16 encoded surrogate pairs, which are illegal in ucs-4
encoded wchar_t strings.
inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}

Can I rely on that to convert all UTF-32 to UTF-8?
Now I am sure you can create similar functions to convert from and
to those other string types.

As I've already indicated, QString provides conversion functions. Xerces
also provides a transcode function, but it is not as easy to use.
Moreover, transcoding is expensive.
If you need to use all three of those string types, my advice is to
make the Standard string types your internal string types (those you
work with) and not litter your code with non-standard string classes.
Then convert your std::string/std::wstring objects to whatever string
class is required.

With Xerces, that really is not an option.
 
S

Steven T. Hatton

peter said:
I can not see how using a variable-length character set can be faster
than using a fixed size one. In my opinion, UTF16 gives you nothing
compared to UTF8 unless you totally ignore the fact that the data might
be encoded. But if you do that, how are you going to react to input
from the surrounding world?

I don't understand what you mean here. If my understanding is correct, UTF-8
uses different numbers of bytes for different characters. UTF-16 does that
far less. UTF-32 doesn't do it at all.
 
J

Jakob Bieling

Steven T. Hatton said:
Jakob Bieling wrote:

Can I rely on that to convert all UTF-32 to UTF-8?

No, the output will only be a string of single-byte characters
(non-UTF). Thus you will lose information.

You should disregard my comment about using
std::wstring/std::string. I was not aware of the complexity of UTF until
a few minutes ago and should not have answered in this extent with my
half-knowledge about it.

regards
 
P

Pete Becker

Steven said:
I don't understand what you mean here. If my understanding is correct, UTF-8
uses different numbers of bytes for different characters. UTF-16 does that
far less. UTF-32 doesn't do it at all.

Yes, but "far less" is not zero, and code that deals with UTF-8 or
UTF-16 has to be aware of the possibility of multi-character encodings.
Code for UTF-32 does not, so it's far simpler. For example, if you're
moving around in an array of characters, moving N characters in UTF-32
is just a pointer adjustment. Moving N characters in UTF-8 or UTF-16
requires examining every character along the way.
 
S

Steven T. Hatton

Pete said:
On most systems, an 8-bit byte is the fundamental addressable storage
unit. Simple char arrays use one byte per char. When wchar_t is 16 bits
it occupies two bytes. When it's 32 bits it occupies four bytes. Try it:

int main()
{
char cvalues[2];
printf("%p %p\n", &cvalues[0], &cvalues[1]);
short svalues[2]; // assuming 16-bit short
printf("%p %p\n", &svalues[0], &svalues[1]);
return 0;
}

But that doesn't tell me what's going on in terms of physical storage. An
octet may be the smallest addressable unit of storage, but that doesn't
mean it is the smallest retrievable unit of storage. Data is not moved
around in 8-bit chunks. The smallest chunk of data that gets moved between
registers on a 32-bit processor is a 32 bit word.

I've been trying to contrive some kind of a test, but my implementation
seems to be trying to outsmart me by reusing some of the data I'm feeding
it.
 
S

Steven T. Hatton

Pete said:
Yes, but "far less" is not zero, and code that deals with UTF-8 or
UTF-16 has to be aware of the possibility of multi-character encodings.
Code for UTF-32 does not, so it's far simpler. For example, if you're
moving around in an array of characters, moving N characters in UTF-32
is just a pointer adjustment. Moving N characters in UTF-8 or UTF-16
requires examining every character along the way.
Well, if I happen to know the particular subset of UTF-16 I'm dealing with
will not have any second plane (IIRC) characters, then I can ignore the
fact that some UTF-16 is, indeed, multi-unit. At one point I was under the
impression that this is what UCS-2 was about, but I don't belive that is
correct.
 
P

peter koch

Steven said:
Jakob Bieling wrote:
[snip]

XMLCh should be defined to be a type suitable for holding a utf-16 encoded
(16 bit) value, usually an unsigned short.

All XML data is handled within Xerces-C++ as strings of XMLCh characters.
Regardless of the size of the type chosen, the data stored in variables of
type XMLCh will always be utf-16 encoded values.

Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is
utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is
not based on Unicode at all (HP/UX, AS/400, system 390).

Some earlier releases of Xerces-C++ defined XMLCh to be the same type as
wchar_t on most platforms, with the goal of making it possible to pass
XMLCh strings to library or system functions that were expecting wchar_t
parameters. This approach has been abandoned because of

* Portability problems with any code that assumes that the types of XMLCh
and wchar_t are compatible

* Excessive memory usage, especially in the DOM, on platforms with 32 bit
wchar_t.

* utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t
on Solaris and Linux. The problem occurs with Unicode characters with
values greater than 64k; in ucs-4 the value is stored as a single 32 bit
quantity. With utf-16, the value will be stored as a "surrogate pair" of
two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still
create the utf-16 encoded surrogate pairs, which are illegal in ucs-4
encoded wchar_t strings.
</quote>
inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}

Can I rely on that to convert all UTF-32 to UTF-8?

You can not, and I do not believe the code above is correct (no
error-detection).
Still it is (in my opinion, I am not sure this is required by the
standard) a bad idea to store UTF-8 or UTF-16 data in a
std::basic_string.
I would expect for a standard string s that s[n] gives the character at
position n and that s.size() gives me the number of characters in that
character. For encoded strings this is simply wrong.

/Peter
As I've already indicated, QString provides conversion functions. Xerces
also provides a transcode function, but it is not as easy to use.
Moreover, transcoding is expensive.

Is it that bad? I would expect most conversions from one characterset
to another to be relatively fast - most likely bounded by the memory
bandwidth available.
With Xerces, that really is not an option.

If the Xerces interface is inadequate in that direction you should
provide a wrapper to Xerces - converting (most likely) your UCS-4
characters to the internal Xerces format on the way in to Xerces and
convert the other way when reading Xerces data.

/Peter
 
P

Phlip

Pete said:
Yes, but "far less" is not zero, and code that deals with UTF-8 or
UTF-16 has to be aware of the possibility of multi-character encodings.
Code for UTF-32 does not

I thought all UTFs had multi-character rules. I'm aware UTF-32 will only
need them after we add >4 billion characters to Unicode. Maybe after
meeting a couple thousand alien species, each with a diverse culture...

(Furtherless, the task of moving N glyphs raises its ugly head, because some
are composites...)

Steven said:
So what is the worst thing about C++? '#'

Anyone who learns only part of a language, and its styles and idioms, will
have the potential to abuse some feature. You could say the same thing
about 'if' statements. They have a great potential for abuse. Yet you don't
often read posts here bragging "I know better than to abuse 'if'
statements!!"

Except me. ;-)
 
P

Pete Becker

Steven said:
But that doesn't tell me what's going on in terms of physical storage.

Sure it does. It tells you the addresses where those things are stored.
An
octet may be the smallest addressable unit of storage, but that doesn't
mean it is the smallest retrievable unit of storage. Data is not moved
around in 8-bit chunks.

On many systems it is. But, granted, when you're talking "normal"
desktop systems, you're generally dealing with 32-bit bus widths.
The smallest chunk of data that gets moved between
registers on a 32-bit processor is a 32 bit word.

Even if it's true, it doesn't matter. Register to register moves are
fast. It sounds like you're trying to micro-optimize for cache behavior.
I much prefer to leave that to the compiler writers. They know a great
deal more about it than I do.
 
P

Pete Becker

peter said:
Still it is (in my opinion, I am not sure this is required by the
standard) a bad idea to store UTF-8 or UTF-16 data in a
std::basic_string.
I would expect for a standard string s that s[n] gives the character at
position n and that s.size() gives me the number of characters in that
character. For encoded strings this is simply wrong.

I agree: basic_string has no knowledge of variable-length encodings. It
won't give the behavior you want when your text is encoded in UTF-8,
UTF-16, shift-JIS, or any other variable-length encoding.

The answer for shift-JIS is to translate to wide characers and use
basic_string<wchar_t>. For UTF-8 or UTF-16, use a 32-bit character type.
 
S

Steven T. Hatton

Pete said:
Sure it does. It tells you the addresses where those things are stored.
Indeed. I did not look closely enough at the example. Now that I think
about it, an array /will/ store data contiguously. I'm not sure what
happens to individual integer values, or characters.
Even if it's true, it doesn't matter. Register to register moves are
fast. It sounds like you're trying to micro-optimize for cache behavior.
I much prefer to leave that to the compiler writers. They know a great
deal more about it than I do.

I'm trying to understand the rationale for Xerces using their own XMLCh
rather than wchar_t. Their argument is that it requires much less storage
than the 4-byte wchar_t used on Unix/Linux implementations. It would
appear that they are correct. But that puts us back to the question of
whether they need to examine every character while bumping pointers.
 
P

Pete Becker

Steven said:
I'm trying to understand the rationale for Xerces using their own XMLCh
rather than wchar_t.
Okay.

Their argument is that it requires much less storage
than the 4-byte wchar_t used on Unix/Linux implementations. It would
appear that they are correct. But that puts us back to the question of
whether they need to examine every character while bumping pointers.

Yup. That's the tradeoff. They may have decided to ignore that
possibility. That's what Java did, because at the time, all of Unicode
fit in 16 bits. When Unicode grew, they had to hack in support for
variable-width characters.
 
S

Steven T. Hatton

Pete said:
Steven T. Hatton wrote:

Yup. That's the tradeoff. They may have decided to ignore that
possibility. That's what Java did, because at the time, all of Unicode
fit in 16 bits. When Unicode grew, they had to hack in support for
variable-width characters.

I'm pretty sure they bit the bullet and went all the way. That's probably
why transcoding to and from XMLCh is so expensive. Once it's in their
internal form (UTF-16), I suspect there really aren't that many instances
where they need to worry about bumping pointers per character. They surely
don't need it to determine two sequences are equal. If they happen to find
they are not, then they resort to the more expensive operations at the
point of divergence.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top