S
Steven T. Hatton
This is one of the first obstacles I encountered when getting started with
C++. I found that everybody had their own idea of what a string is. There
was std::string, QString, xercesc::XMLString, etc. There are also char,
wchar_t, QChar, XMLCh, etc., for character representation. Coming from
Java where a String is a String is a String, that was quite a shock.
Well, I'm back to looking at this, and it still isn't pretty. I've found
what appears to be a way to go between QString and XMLCh. XMLCh is
reported to be UFT-16. QString is documented to be the same. QString
provides very convenient functions for 'converting'[*] between NTBS const
char*, std::string and QString[**]. So, using QString as an intermediary,
I can construct a std::string from a const XMLCh* NTBS, and a const XMLCh*
NTBS from a std::string.
My question is whether I can do this without the QString intermediary. That
is, given a UTF-16 NTBS, can I construct a std::string representing the
same characters? And given a std::string, can I convert it to a UTF-16
NTBS? I have been told that some w_char implementations are UTF-16, and
some are not.
My reading of the ISO/IEC 14882:2003 is that implementations must support
the UTF-16 character set[***], but are not required to use UTF-16 encoding.
The proper way to express a member of the UTF-16 character set is to use
the form \UNNNNNNNN[****], where NNNNNNNN is the universal-character-name,
or \UNNNN, where NNNN is the character short name of the a
universal-character-name whose value is \U0000NNNN, unless the character is
a member of the basic character set, or if the hexadecimal value of the
character expressed is less than 0x20, or if the hexadecimal value of the
character expressed is in the range 0x7f-0x9f (inclusive). Members of the
UTF-16 character set which are also members of the basic character set are
to be expressed using their literal symbol in an L-prefixed character
literal, or an L-prefixed string literal.
This tells me that the UTF-16 defined by the Xerces XMLCh does not conform
to the definition of the extended character set of a C++ implementation.
http://xml.apache.org/xerces-c/apiDocs/XMLUniDefs_8hpp-source.html
Is my understanding of this situation correct?
UTF-16 seems to be a good candidate for a lingua Franka of runtime character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?
[*] here 'converting' means either type conversion or constructing a new
variable to hold a different encoding of the same characters sequence.
[**]Leaving aside unanswered questions such as whether knowing that both are
encoded as UTF16 is sufficient to assume the representations are identical.
[***]Here UTF-16 is used as a synonym for UCS-2 described in ISO/IEC 10646
"Universal Multiple-Octet Coded Character Set", though there may be subtle
differences.
[****] the case of the 'U' in \UNNNNNNNN or \UNNNN is irrelevant.
So what is the worst thing about C++? '#'
C++. I found that everybody had their own idea of what a string is. There
was std::string, QString, xercesc::XMLString, etc. There are also char,
wchar_t, QChar, XMLCh, etc., for character representation. Coming from
Java where a String is a String is a String, that was quite a shock.
Well, I'm back to looking at this, and it still isn't pretty. I've found
what appears to be a way to go between QString and XMLCh. XMLCh is
reported to be UFT-16. QString is documented to be the same. QString
provides very convenient functions for 'converting'[*] between NTBS const
char*, std::string and QString[**]. So, using QString as an intermediary,
I can construct a std::string from a const XMLCh* NTBS, and a const XMLCh*
NTBS from a std::string.
My question is whether I can do this without the QString intermediary. That
is, given a UTF-16 NTBS, can I construct a std::string representing the
same characters? And given a std::string, can I convert it to a UTF-16
NTBS? I have been told that some w_char implementations are UTF-16, and
some are not.
My reading of the ISO/IEC 14882:2003 is that implementations must support
the UTF-16 character set[***], but are not required to use UTF-16 encoding.
The proper way to express a member of the UTF-16 character set is to use
the form \UNNNNNNNN[****], where NNNNNNNN is the universal-character-name,
or \UNNNN, where NNNN is the character short name of the a
universal-character-name whose value is \U0000NNNN, unless the character is
a member of the basic character set, or if the hexadecimal value of the
character expressed is less than 0x20, or if the hexadecimal value of the
character expressed is in the range 0x7f-0x9f (inclusive). Members of the
UTF-16 character set which are also members of the basic character set are
to be expressed using their literal symbol in an L-prefixed character
literal, or an L-prefixed string literal.
This tells me that the UTF-16 defined by the Xerces XMLCh does not conform
to the definition of the extended character set of a C++ implementation.
http://xml.apache.org/xerces-c/apiDocs/XMLUniDefs_8hpp-source.html
Is my understanding of this situation correct?
UTF-16 seems to be a good candidate for a lingua Franka of runtime character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?
[*] here 'converting' means either type conversion or constructing a new
variable to hold a different encoding of the same characters sequence.
[**]Leaving aside unanswered questions such as whether knowing that both are
encoded as UTF16 is sufficient to assume the representations are identical.
[***]Here UTF-16 is used as a synonym for UCS-2 described in ISO/IEC 10646
"Universal Multiple-Octet Coded Character Set", though there may be subtle
differences.
[****] the case of the 'U' in \UNNNNNNNN or \UNNNN is irrelevant.
So what is the worst thing about C++? '#'