Encoding of character literals

Lauri Alanko · Nov 3, 2011

Hello.

I find C99's language on internationalization features particularly
hard to decipher, so I'd appreciate some clarifications.

I'm particularly interested in the relationship of the execution
character set encoding of character literals and character string
literals, and its relationship with locale encodings and the wchar_t
encoding.

Firstly, is it possible for a locale to use a different encoding for
the basic execution character set than the compiler uses for the
literals? If not, doesn't this mean that the locale system (and the C
standard) are insufficient in an environment where both ASCII- and
EBCDIC-based encodings can be used?

If it is possible for a locale to use a completely different encoding,
then how can ordinary character literals and character string literals
be converted to the locale's encoding? It is of course possible to
convert between wide characters and the locale, but how do I convert
from the literal encoding into the wchar_t encoding?

Is it perhaps guaranteed that ((wchar_t) 'a' == L'a')? I haven't seen
any text to suggest this, and this would mean that an implementation
couldn't use EBCDIC for character literals and UCS-4 for wchar_t. But
maybe someone can give a definitive answer?

It is of course possible to define the mapping manually:

wchar_t char_to_wchar[] = {
['a'] = L'a',
['b'] = L'b',
// ... etc for all of the portable basic character set
};

But this seems like horrible redundant hack that I wouldn't like to
use except as a last resort. Is something like this really necessary
in order to print out character string literals correctly in all
locales?

Lauri

James Kuyper · Nov 3, 2011

On 11/03/2011 04:41 PM, Lauri Alanko wrote:
....

Is it perhaps guaranteed that ((wchar_t) 'a' == L'a')?

I'm no expert on internationalization - as a US programmer I've never
had any need to worry about it. However, that question at least I can
answer:

C99 says, in effect, that the above expression is guaranteed to be true
if the implementation does not pre-define __STDC_MB_NEQ_WC__ (7.17p2).
6.10.8p1 seems to indicate that definition of the macro with a value of
1 is mandatory - but that might be an example of poor wording or a
misinterpretation on my part. It seems inconsistent with the "if" in 7.17p2.

Harald van DÄ³k · Nov 3, 2011

C99 says, in effect, that the above expression is guaranteed to be true
if the implementation does not pre-define __STDC_MB_NEQ_WC__ (7.17p2).
6.10.8p1 seems to indicate that definition of the macro with a value of
1 is mandatory - but that might be an example of poor wording or a
misinterpretation on my part. It seems inconsistent with the "if" in 7.17p2.

It's supposed to be in 6.10.8p2, see DR #333, or a draft of C1x in
which this has been corrected.

Lauri Alanko · Nov 10, 2011

It's supposed to be in 6.10.8p2, see DR #333, or a draft of C1x in
which this has been corrected.

Thanks, that is useful. So C99 mandates that for the basic character
set, chars and the corresponding wchar_t's have the same integer
value, and C1x makes this guarantee conditional on the presence of the
macro.

But is btowc guaranteed to honor this equality in all locales? And, if
__STDC_MB_NEQ_WC__ is defined, and btowc is the only way to convert a
char to wchar_t, is it guaranteed to work correctly on integer
character constants (from the basic character set) in all locales?
That is, is (btowc('a') == L'a') going to be true in all
implementations in all legit locales? And if not, how

The corner case I'm thinking of is of course the situation where the
native encoding used by integer character literals is EBCDIC, but
wchar_t uses UCS-4, and the current locale is ASCII-based. So one
cannot cast from integer character literals to wchar_t, but one also
cannot use locale-dependent conversion functions. Is this a situation
that standard C is even able to support?

Lauri

lawrence.jones · Nov 10, 2011

Lauri Alanko said:
Thanks, that is useful. So C99 mandates that for the basic character
set, chars and the corresponding wchar_t's have the same integer
value, and C1x makes this guarantee conditional on the presence of the
macro.

No, there was a production error in N1256 which put the macro in the
wrong paragraph; it was always supposed to have been conditional.

Non latin characters in string literals	17	Jan 3, 2010
wchar_t is useless	18	Nov 21, 2011
Multicharacter literals	4	Aug 22, 2012
Questions on various string literals in c++0x	1	Dec 7, 2010
Questions on ISO C character constants	1	Nov 8, 2011
Questions on character constants	2	Dec 12, 2010
configuring STD* IO to use locale's encoding?	6	Aug 3, 2013
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021

Encoding of character literals

Lauri Alanko

James Kuyper

Harald van DÄ³k

Lauri Alanko

lawrence.jones

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads