Character set

Amit Kumar · Jun 22, 2009

Stroustrup says: "A variable of type 'char' can hold a character of
the implementation's character set."

I have numerous doubts related to character sets.

What do we mean by 'implementation's character set'? Do we mean
'Operating System' by implementation?
If yes, then is the notion of char independent of compiler (i.e.
dependent only on OS)?

Does Win XP japanese have a different character set than Win XP
english? If yes then what are those character sets? How can I find the
name of these character sets?

What is platform encoding? Will the platform encoding change if I
change the locale of the os?

I know these questions might be very easy for most of you but I am
seriously confused. Can you please suggest me some good books or
online documents?

Alf P. Steinbach · Jun 22, 2009

* Amit Kumar:

Stroustrup says: "A variable of type 'char' can hold a character of
the implementation's character set."

I have numerous doubts related to character sets.

What do we mean by 'implementation's character set'? Do we mean
'Operating System' by implementation?
If yes, then is the notion of char independent of compiler (i.e.
dependent only on OS)?

Does Win XP japanese have a different character set than Win XP
english? If yes then what are those character sets? How can I find the
name of these character sets?

What is platform encoding? Will the platform encoding change if I
change the locale of the os?

I know these questions might be very easy for most of you but I am
seriously confused. Can you please suggest me some good books or
online documents?

The C++0x draft standard is available in PDF format from the committee pages.

The various character sets are discussed at the very start.

Cheers & hth.,

- Alf

Ron · Jun 22, 2009

Stroustrup says: "A variable of type 'char' can hold a character of
the implementation's character set."

I have numerous doubts related to character sets.

Your doubts are well founded. The standard says the implementation's
basic character set which will depend on the compiler and the
operating
system.

C and C++ are schizoid in their idea of what the meaning of "char" is,
unfortunately. It is both the smallest addressable unit of storage
as well as the container for a character from the basic set. This
means
practically (and this applies to *ALL* versions of any windows
compiler
I've seen) the basic char set is still 8 bits no matter how much you'd
like it to be larger. Even US Windows is natively a 16-bit UNICODE
machine. However, just as with any implementation that deals with
native wider characters, they kludge the C++ basic character set to
8 bits and then have to deal with the omissions in a half a dozen
places where C++ assumes everything representable string can be
converted to a null-terminated string of char (presumably by some
multibyte encoding).

REH · Jun 22, 2009

Your doubts are well founded.

He means he has questions (which followed that statement).

REH

Amit Kumar · Jun 22, 2009

Hi Alf, Ron and Andy,
Thanks a lot for your valuable inputs.

[Ron]: Even US Windows is natively a 16-bit UNICODE machine.
[Andy]:If you are writing for Windows only I would advise you to use wchar_t

Click to expand...

throughout

The 'W' varients of the windows APIs take UTF-16 encoded null
terminated strings and the 'A' varients require platform encoded null
terminated strings (and not the UTF-8 encoded strings; AFAIK)

The question arises: Can I really use wchar_t to store a UTF-16
encoded character and std::wstring to store a UTF-16 encoded string?

Stroustrup: "The size of wchar_t is implementation defined and large
enough to hold the largest character set support by the
implementation's locale."

Since it is not guaranteed that wchar_t is 16 bits, I cannot simply
store a UTF-16 string in std::wstring and call .c_str() to obtain a
UTF16* for a Windows utf-16 based API.

Even more frustrating and annoying thing is that I cannot even store a
utf-8 string in std::string. Why? Because std::string is
std::basic_string<char>, and char is not guaranteed to be 8 bits
(though it is practically always 8 bits, as pointed out by Ron).

Alf P. Steinbach · Jun 22, 2009

* Amit Kumar:

Hi Alf, Ron and Andy,
Thanks a lot for your valuable inputs.

[Ron]: Even US Windows is natively a 16-bit UNICODE machine.
[Andy]:If you are writing for Windows only I would advise you to use wchar_t

Click to expand...

Click to expand...

throughout

The 'W' varients of the windows APIs take UTF-16 encoded null
terminated strings and the 'A' varients require platform encoded null
terminated strings (and not the UTF-8 encoded strings; AFAIK)

The question arises: Can I really use wchar_t to store a UTF-16
encoded character

In Windows, yes provided you're limiting yourself to the Basic Multilingual
Plane, the "BMP", which essentially is the original 16-bit Unicode.

In Windows a wchar_t is 16 bits.

This is due to historical reasons (Microsoft was among the founders of the
Unicode Consortium, IIRC).

and std::wstring to store a UTF-16 encoded string?

Yes, and without the above mentioned limitation.

Stroustrup: "The size of wchar_t is implementation defined and large
enough to hold the largest character set support by the
implementation's locale."

Since it is not guaranteed that wchar_t is 16 bits,

In practice wchar_t is 16 bits or larger on any platform, and in Windows it's
exactly 16 bits.

I cannot simply
store a UTF-16 string in std::wstring and call .c_str() to obtain a
UTF16* for a Windows utf-16 based API.

Happily that's incorrect.

However, note that Windows uses three different wide string representations:
ordinary zero-terminated strings, string buffers with separate length, and so
called B-strings (Basic language strings), where you have a pointer to the first
wchar_t following a string length field which as I recall is 16 bits. The
B-strings are created by SysAllocString & friends.

Microsoft's C++ compiler, Visual C++, supports B-strings and other Windows
specific types (including an intrusive smart pointer for COM objects) via some
run-time library types.

Even more frustrating and annoying thing is that I cannot even store a
utf-8 string in std::string.

Happily that's also incorrect.

Why? Because std::string is
std::basic_string<char>, and char is not guaranteed to be 8 bits

And happily

, that's also incorrect. 'char' is indeed guaranteed to be at
least 8 bits. See the FAQ for that and other guarantees.

(though it is practically always 8 bits, as pointed out by Ron).

*Hark*. As far as I can see Ron did not make any such mistake.

Cheers & hth.,

- Alf

Amit Kumar · Jun 22, 2009

Why? Because std::string is

And happily , that's also incorrect. 'char' is indeed guaranteed to be at
least 8 bits. See the FAQ for that and other guarantees

I was aware that char is guaranteed to be at least 8 bits. I was
thinking of situations where char can be more than 8 bits.

Suppose 'char' is 16 bits on a machine and we want to call a method
which expects a utfchar*

void myMethod(utf8char*);

where utf8char is 8 bits.

If we have a std::string whose every character is 16 bits then how
will we pass that string to this method?

myMethod(str.c_str()) will give compilation error (if str is of type
std::string).
myMethod((utf8char*)str.c_str()) will give junk result.

maybe this will work:

const char* cstr = str.c_str();
std::vector<utf8char> buff;
for(size_t i = 0; cstr != '\0' ; ++i)
buff.push_back((utf8char)(cstr));

myMethod(&buff[0]);

But isn't this awkward? In this case we are paying the penalty for
using bigger chars than needed.

Also, I was wondering about any implementation on which char is not 8
bits. I have tried gcc anc cl.exe on Win and gcc on mac. These have 8
bit char.

Alf P. Steinbach · Jun 23, 2009

* Amit Kumar:

I was aware that char is guaranteed to be at least 8 bits. I was
thinking of situations where char can be more than 8 bits.

Suppose 'char' is 16 bits on a machine and we want to call a method
which expects a utfchar*

void myMethod(utf8char*);

where utf8char is 8 bits.

That is not possible.

'char' is the smallest adressable unit in C and C++.

You cannot have a type that is smaller than size 1.

Cheers & hth.,

- Alf

James Kanze · Jun 23, 2009

Stroustrup says: "A variable of type 'char' can hold a
character of the implementation's character set."

I have numerous doubts related to character sets.

What do we mean by 'implementation's character set'? Do we
mean 'Operating System' by implementation? If yes, then is
the notion of char independent of compiler (i.e. dependent
only on OS)?

In this case (supposing that Stroustrup is trying to express
exactly what the standard says), the character set in question
is the "implementation's basic execution character set". The
C++ implementation; in C++, the basic execution character set
consists of exactly 100 characters: the 96 basic characters used by
C++, plus '\a', '\b', '\r' and '\0'. Anything else is part of
the extended execution character set.

The standard imposes a couple of constraints for the basic
character set: that they be representable in a single char (no
multibyte characters), and that they have positive values when
stored in a char (no high bit set if char is signed). It makes
no such requirements for the extended character set, however.

Does Win XP japanese have a different character set than Win
XP english? If yes then what are those character sets? How can
I find the name of these character sets?

That's all very implementation defined. The Windows boxes I
have access to all use ISO 8859-1 as the default extended
execution character set; I don't know if that's univeral,
however, or if it can be changed dynamically (by changing the
locale).

What is platform encoding? Will the platform encoding change
if I change the locale of the os?

What IS the platform encoding? And how can you determine it
within the program? There are no simple and portable answers.
For wchar_t, Windows and AIX use UTF-16, I think; Linux uses
UTF-32 and Solaris UCS. The first two regardless of the locale.
For char, on the other hand, the encoding may (and probably
does) depend on the locale. But the locale only determines how
the functions in the C++ library (and probably other libraries)
interpret the encoding; it has no effect outside of the program:
if you write UTF-8 to a file, and then send the file to a
printer which supposes ISO 8859-1, the printer will interpret
the bytes as ISO 8859-1, regardless of the locale which was
active when you wrote the file.

I know these questions might be very easy for most of you

No they're not. They're a source of problems for even the most
experienced programmers.

but I am seriously confused.

Just remember that the machine itself doesn't know anything
about character encodings. A char is just so many bits (usually
8), with a numeric value, and nothing more. It's the individual
programs which "interpret" the numeric value as a character, and
thus define the encoding. It's a question of the conventions
used by each of the programs to ensure that they interpret the
numeric values in the same way---C/C++ use locale to condition
how the standard library interpret such encodings, and hopefully
any other librarys will do the same. Beyond that, different
systems have different conventions, which are more or less
respected. (Unix, for example, defines a number of environment
variables which map directly to the categories in <locale.h>;
programs are supposed to respect these. Except that for
display, the terminal windows will use the encoding of the font
which has been selected, rather than one determined from the
environment variables.)

Can you please suggest me some good books or online documents?

Regretfully, I don't know of any that cover everything. Perhaps
the one that comes the closest is "Fontes et Codages", by Yannis
Haralambous, but it's many concerned with display
considerations, and much less with program internal
considerations; it's also very Unicode oriented. (I'm pretty
sure that the book has been translated into English. Check
O'Reilly's pages, searching for the author, and you should find
it.)

James Kanze · Jun 23, 2009

Your doubts are well founded. The standard says the
implementation's basic character set which will depend on the
compiler and the operating system.

Yes and no. It says that the basic execution character set will
consist of exactly 100 characters, and it lists them. If
furthermore guarantees that they all have one byte
representations, and that they will be encoded with a positive
number when stored in a char (which means that if char is an 8
bit signed type, they will be in the range 0...127). It doesn't
say anything about the actual encoding, however (except that all
of the 100 characters must be distinct, and that '\0' must be
encoded 0); implementations have used EBCDIC, for example.

C and C++ are schizoid in their idea of what the meaning of
"char" is, unfortunately. It is both the smallest
addressable unit of storage as well as the container for a
character from the basic set. This means practically (and
this applies to *ALL* versions of any windows compiler I've
seen) the basic char set is still 8 bits no matter how much
you'd like it to be larger.

But why would you want it to be larger? All encodings I know
have an 8 bit (or less) encoding form, which can be used. And
when you need fixed length representations, that's what wchar_t
was conceived for (although for historical reasons, it also uses
multi-element encodings under Windows and AIX).

James Kanze · Jun 23, 2009

Hi Alf, Ron and Andy,
Thanks a lot for your valuable inputs.

[Ron]: Even US Windows is natively a 16-bit UNICODE machine.
[Andy]:If you are writing for Windows only I would advise you to use wchar_t
throughout

Click to expand...

Click to expand...

The 'W' varients of the windows APIs take UTF-16 encoded null
terminated strings and the 'A' varients require platform
encoded null terminated strings (and not the UTF-8 encoded
strings; AFAIK)

At least on the Windows machines I use, the 8 bit encodings are
ISO 8859-1 (not UTF-8). Note, however, that like Unix, there
are a lot of interfaces which do no more than copy the bytes,
without interpreting. It may be impossible, for example, to
create a filename in UTF-8, but you can certainly write and read
UTF-8 to and from the file.

The question arises: Can I really use wchar_t to store a
UTF-16 encoded character and std::wstring to store a UTF-16
encoded string?

Stroustrup: "The size of wchar_t is implementation defined and
large enough to hold the largest character set support by the
implementation's locale."

Since it is not guaranteed that wchar_t is 16 bits, I cannot
simply store a UTF-16 string in std::wstring and call .c_str()
to obtain a UTF16* for a Windows utf-16 based API.

You almost certainly can if you're under Windows. And code
which calls Windows UTF-16 based APIs isn't going to be portable
elsewhere anyway.

Even more frustrating and annoying thing is that I cannot even
store a utf-8 string in std::string.

Of course you can. I do it all the time. (Technically, there
is a slight problem if char is 8 bit signed, since the
conversion of an unsigned value, like 0xC3, to signed is
implementation defined, but in practice, no implementation would
dare break this: if the "conversion" doesn't just copy the bits,
the implementation will certainly make char unsigned if it has
only 8 bits.)

Why? Because std::string is std::basic_string<char>, and char
is not guaranteed to be 8 bits (though it is practically
always 8 bits, as pointed out by Ron).

char is guaranteed to be at least 8 bits. If it is more, you
can still store 8 bit values in it. The only possible problem
is signedness, and the conversion of a value in the range
0x80-0xFF to the signed char, and in practice, you're certainly
safe here as well.

Pascal J. Bourguignon · Jun 23, 2009

Amit Kumar said:
Hi Alf, Ron and Andy,
Thanks a lot for your valuable inputs.

[Ron]: Even US Windows is natively a 16-bit UNICODE machine.
[Andy]:If you are writing for Windows only I would advise you to use wchar_t

Click to expand...

Click to expand...

throughout

The 'W' varients of the windows APIs take UTF-16 encoded null
terminated strings and the 'A' varients require platform encoded null
terminated strings (and not the UTF-8 encoded strings; AFAIK)

The question arises: Can I really use wchar_t to store a UTF-16
encoded character and std::wstring to store a UTF-16 encoded string?

Stroustrup: "The size of wchar_t is implementation defined and large
enough to hold the largest character set support by the
implementation's locale."

Since it is not guaranteed that wchar_t is 16 bits, I cannot simply
store a UTF-16 string in std::wstring and call .c_str() to obtain a
UTF16* for a Windows utf-16 based API.

Even more frustrating and annoying thing is that I cannot even store a
utf-8 string in std::string. Why? Because std::string is
std::basic_string<char>, and char is not guaranteed to be 8 bits
(though it is practically always 8 bits, as pointed out by Ron).

The key word here is "implementation defined".

You must decide whether you want to write code that is "implementation
dependant", in which case you must read the documentation of the
implementation you depend on, and may use all the implementation
specific features you want, or whether you want to write portable code.

In the later case, you will still have the difficulty that it is not
specified exactly how many bits the types char, short, int, and long
may hold. Happily, there are now standard typedefs that are
guaranteed to hold a minimum number of bits easier to use that these
basic types in stdint.h:

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t,
uint64_t contain exactly the number of bits indicated. However these
types may not be available when the underlying machine has a different
word size.

So you will rather use:

int_least8_t, uint_least8_t, int_least16_t, uint_least16_t,
int_least32_t, uint_least32_t, int_least64_t, uint_least64_t.

These types always exist, and are guaranteed to hold at least the
indicated number of bits.

So now you can write:

typedef std::basic_string<uint_least16_t> utf16_string;
typedef std::basic_string<uint_least32_t> unicode_string;

and be sure that these strings will allow you to hold any utf-16 word
or unicode code point.

Anything less is bound to give your porting headaches.

(It's unfortunate that there's no uint_least21_t type that could save
some bits on some machines, but really, it's unfortunate there's no
way to just tell C/C++ what range we want out integers to be:
typedef integer<0,0x10ffff> unicode;
/* integer template left as an exercise for the reader */)

Bart van Ingen Schenau · Jun 23, 2009

Amit said:
I was aware that char is guaranteed to be at least 8 bits. I was
thinking of situations where char can be more than 8 bits.

Suppose 'char' is 16 bits on a machine and we want to call a method
which expects a utfchar*

void myMethod(utf8char*);

where utf8char is 8 bits.

That situation can not arise, as char is by definition the smallest
(non-bitfield) type that is supported.

Also, I was wondering about any implementation on which char is not 8
bits. I have tried gcc anc cl.exe on Win and gcc on mac. These have 8
bit char.

The size of char (in bits) depends mostly on the hardware you are using.
To get a char that is larger than 8 bits, you are moving to more
'exotic' systems, like DSP's and mainframes.

Bart v Ingen Schenau

James Kanze · Jun 23, 2009

* Amit Kumar:

In Windows, yes provided you're limiting yourself to the Basic
Multilingual Plane, the "BMP", which essentially is the
original 16-bit Unicode.

Do you really have to limit yourself to the BMP? It was my
understanding that modern Windows handles surrogates correctly.
(Most of the time, it shouldn't be an issue anyway, as a
surrogate compares and copies just like any other character.)

In Windows a wchar_t is 16 bits.

This is due to historical reasons (Microsoft was among the
founders of the Unicode Consortium, IIRC).

So was Sun, but Unicode is still not the default encoding for
either char (ISO 8859-1, rather than UTF-8) nor wchar_t (EUC,
rather than UTF-32)

. (Sun was actually involved before
Microsoft, although Microsoft seems to have endorsed it more.)

Yes, and without the above mentioned limitation.

In practice wchar_t is 16 bits or larger on any platform, and
in Windows it's exactly 16 bits.

I wouldn't be sure of that; it wouldn't surprise me if some
embedded platforms had smaller wchar_t. 16 bits or more does
seem a safe bet for any platform where you're likely to want to
manipulate text, however.

[...]

Happily that's also incorrect.

And happily , that's also incorrect. 'char' is indeed
guaranteed to be at least 8 bits. See the FAQ for that and
other guarantees.

Technically, CHAR_MAX can be (and often is) 127, and attempting
to assign a larger value (e.g. 0xC3, the first byte of 'é' in
UTF-8) to it has implementation defined results. Practically,
assigning values in the range 0...UCHAR_MAX is such an
established idiom that no implementation would dare break it.
If for some reason, converting an unsigned char to char and back
does not preserve value (likely the case if the machine doesn't
use 2's complement), then char will almost certainly be unsigned
(and on such exotic machines, might have more than 8 bits as
well). So in practice, I don't think you have to worry.

Pascal J. Bourguignon · Jun 23, 2009

James Kanze said:
I wouldn't be sure of that; it wouldn't surprise me if some
embedded platforms had smaller wchar_t. 16 bits or more does
seem a safe bet for any platform where you're likely to want to
manipulate text, however.

Like displaying either:
æ‰“å°æœºå‡†å¤‡
ÐŸÑ€Ð¸Ð½Ñ‚ÐµÑ€ Ð³Ð¾Ñ‚Ð¾Ð²
à¹€à¸„à¸£à¸·à¹ˆà¸à¸‡à¸žà¸´à¸¡à¸žà¹Œà¸žà¸£à¹‰à¸à¸¡
or
PRINTER READY
depending on user configuration on the embedded LCD display of a printer...

Alf P. Steinbach · Jun 23, 2009

* James Kanze:

Do you really have to limit yourself to the BMP? It was my
understanding that modern Windows handles surrogates correctly.
(Most of the time, it shouldn't be an issue anyway, as a
surrogate compares and copies just like any other character.)

I'm assuming Amit meant a single wchar_t, since Amit then goes on to ask
additionally about std::wstring. And for a single wchar_t: yes.

But also yes, modern Windows generally handles surrogates correctly.

I'm not sure whether that holds completely generally, however. In NT4 and
perhaps also Windows 2000 conversion from Windows ANSI Western to Unicode, using
the Windows API routine for that, didn't as I recall convert non-Latin 1
characters correctly (e.g. the Euro sign €, long dashes, and roundish
flyspecks-whatever). I don't think I've checked this again for Windows XP, but
problems in one aspect Unicode handling indicates possible problems elsewhere too.

[snip]

Technically, CHAR_MAX can be (and often is) 127, and attempting
to assign a larger value (e.g. 0xC3, the first byte of 'é' in
UTF-8) to it has implementation defined results. Practically,
assigning values in the range 0...UCHAR_MAX is such an
established idiom that no implementation would dare break it.
If for some reason, converting an unsigned char to char and back
does not preserve value (likely the case if the machine doesn't
use 2's complement), then char will almost certainly be unsigned
(and on such exotic machines, might have more than 8 bits as
well). So in practice, I don't think you have to worry.

And even formally it's simply not an issue if one does not convert by assignment.

The standard's §3.9/1 guarantees that copying a POD to a sequence of 'char' or
'unsigned char', and back, yields the original value, so any string of 'unsigned
char', say computed by a simple to-UTF-8 conversion, can be copied to a string
of 'char' by reinterpreting the bitpatterns (e.g. via memcpy), value-preserving.

And since §3.9/1 does not specify the manner of copying, it implies that
assignment of 'char' to 'char' always preserves the bitpattern.

- Alf

James Kanze · Jun 24, 2009

* James Kanze:

I'm not sure whether that holds completely generally, however.
In NT4 and perhaps also Windows 2000 conversion from Windows
ANSI Western to Unicode, using the Windows API routine for
that, didn't as I recall convert non-Latin 1 characters
correctly (e.g. the Euro sign €, long dashes, and roundish
flyspecks-whatever). I don't think I've checked this again for
Windows XP, but problems in one aspect Unicode handling
indicates possible problems elsewhere too.

I'm not sure, but I seem to remember that some versions of
Windows (or maybe it was just some versions of the C++ library
under Windows) simply zero extended the char to get a wchar_t.
This would mean that char was in fact always interpreted as ISO
8859-1. (Caveat: the only Windows machines I've worked on are
those sold in France and Germany. Where ISO 8859-1 was more or
less the standard encoding for bytes. Other national versions
might also work differently.)

[snip]

Technically, CHAR_MAX can be (and often is) 127, and attempting
to assign a larger value (e.g. 0xC3, the first byte of 'é' in
UTF-8) to it has implementation defined results. Practically,
assigning values in the range 0...UCHAR_MAX is such an
established idiom that no implementation would dare break it.
If for some reason, converting an unsigned char to char and back
does not preserve value (likely the case if the machine doesn't
use 2's complement), then char will almost certainly be unsigned
(and on such exotic machines, might have more than 8 bits as
well). So in practice, I don't think you have to worry.

Click to expand...

And even formally it's simply not an issue if one does not
convert by assignment.

The standard's §3.9/1 guarantees that copying a POD to a
sequence of 'char' or 'unsigned char', and back, yields the
original value, so any string of 'unsigned char', say computed
by a simple to-UTF-8 conversion, can be copied to a string of
'char' by reinterpreting the bitpatterns (e.g. via memcpy),
value-preserving.

Good point.

In practice, there's just so much code aroung that assigns
values in the range 0...UCHAR_MAX to char's, that no
implementation would dare make this not value preserving.
(Things like:

char* p = ... ;
int ch = getchar() ;
while ( ch != EOF && ch != '\n' ) {
*p = ch ;
++ p ;
ch = getchar() ;
}

were, and probably still are, a standard idiom in C, since the
days of K&R 1. Despite the fact that according to the C
standard, the results of the assignment "*p = ch" is
implementation defined, and may even result in an implementation
defined signal being raised.)

And since §3.9/1 does not specify the manner of copying, it
implies that assignment of 'char' to 'char' always preserves
the bitpattern.

The assignment of char to char. On some implementations, the
assignment of signed char to unsigned char is required to modify
the bit pattern; if char were signed on such implementations,
you could get some strange behavior.

There's an interesting difference between C and C++ here as
well: in C, char can definitely have trapping representations
(if it is in fact signed). And in both languages, char can have
two different bit patterns compare equal (again, if it is
signed). In practice, because of the above considerations of
usability, any implementation where this might occur (signed
values are not 2's complement) will make plain char unsigned, to
avoid any problems.

Alf P. Steinbach · Jun 24, 2009

* James Kanze:

* James Kanze:

Click to expand...

I'm not sure whether that holds completely generally, however.
In NT4 and perhaps also Windows 2000 conversion from Windows
ANSI Western to Unicode, using the Windows API routine for
that, didn't as I recall convert non-Latin 1 characters
correctly (e.g. the Euro sign €, long dashes, and roundish
flyspecks-whatever). I don't think I've checked this again for
Windows XP, but problems in one aspect Unicode handling
indicates possible problems elsewhere too.

Click to expand...

I'm not sure, but I seem to remember that some versions of
Windows (or maybe it was just some versions of the C++ library
under Windows) simply zero extended the char to get a wchar_t.
This would mean that char was in fact always interpreted as ISO
8859-1. (Caveat: the only Windows machines I've worked on are
those sold in France and Germany. Where ISO 8859-1 was more or
less the standard encoding for bytes. Other national versions
might also work differently.)

[snip]

Why? Because std::string is std::basic_string<char>, and
char is not guaranteed to be 8 bits
And happily , that's also incorrect. 'char' is indeed
guaranteed to be at least 8 bits. See the FAQ for that and
other guarantees.
Technically, CHAR_MAX can be (and often is) 127, and attempting
to assign a larger value (e.g. 0xC3, the first byte of 'é' in
UTF-8) to it has implementation defined results. Practically,
assigning values in the range 0...UCHAR_MAX is such an
established idiom that no implementation would dare break it.
If for some reason, converting an unsigned char to char and back
does not preserve value (likely the case if the machine doesn't
use 2's complement), then char will almost certainly be unsigned
(and on such exotic machines, might have more than 8 bits as
well). So in practice, I don't think you have to worry.

Click to expand...

Click to expand...

And even formally it's simply not an issue if one does not
convert by assignment.

Click to expand...

The standard's §3.9/1 guarantees that copying a POD to a
sequence of 'char' or 'unsigned char', and back, yields the
original value, so any string of 'unsigned char', say computed
by a simple to-UTF-8 conversion, can be copied to a string of
'char' by reinterpreting the bitpatterns (e.g. via memcpy),
value-preserving.

Click to expand...

Good point.

In practice, there's just so much code aroung that assigns
values in the range 0...UCHAR_MAX to char's, that no
implementation would dare make this not value preserving.
(Things like:

char* p = ... ;
int ch = getchar() ;
while ( ch != EOF && ch != '\n' ) {
*p = ch ;
++ p ;
ch = getchar() ;
}

were, and probably still are, a standard idiom in C, since the
days of K&R 1. Despite the fact that according to the C
standard, the results of the assignment "*p = ch" is
implementation defined, and may even result in an implementation
defined signal being raised.)

And since §3.9/1 does not specify the manner of copying, it
implies that assignment of 'char' to 'char' always preserves
the bitpattern.

Click to expand...

The assignment of char to char.

Yes, this concerns possible decay, bitpattern change, within a std::string (the
OP's question), e.g. within an assignment.

Essentially, there will be no such decay unless the std::string implementation
does type conversion.

On some implementations, the
assignment of signed char to unsigned char is required to modify
the bit pattern; if char were signed on such implementations,
you could get some strange behavior.

There's an interesting difference between C and C++ here as
well: in C, char can definitely have trapping representations
(if it is in fact signed). And in both languages, char can have
two different bit patterns compare equal (again, if it is
signed). In practice, because of the above considerations of
usability, any implementation where this might occur (signed
values are not 2's complement) will make plain char unsigned, to
avoid any problems.

Summing up, there's not a practical problem and there's not a formal problem.

- Alf

Richard Herring · Jun 25, 2009

Alf P. Steinbach said:
However, note that Windows uses three different wide string
representations: ordinary zero-terminated strings, string buffers with
separate length, and so called B-strings (Basic language strings),
where you have a pointer to the first wchar_t following a string length
field which as I recall is 16 bits.
32.

The B-strings are created by SysAllocString & friends.

Microsoft's C++ compiler, Visual C++, supports B-strings and other
Windows specific types (including an intrusive smart pointer for COM
objects) via some run-time library types.

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Encoding of character literals	4	Nov 3, 2011
wide character file to wstring - unexpected results	1	Dec 14, 2011
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Command Line Arguments	0	Mar 7, 2023
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Questions on ISO C character constants	1	Nov 8, 2011

Character set

Amit Kumar

Alf P. Steinbach

Ron

REH

Amit Kumar

Alf P. Steinbach

Amit Kumar

Alf P. Steinbach

James Kanze

James Kanze

James Kanze

Pascal J. Bourguignon

Bart van Ingen Schenau

James Kanze

Pascal J. Bourguignon

Alf P. Steinbach

James Kanze

Alf P. Steinbach

Richard Herring

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads