Encoding of character literals

Discussion in 'C Programming' started by Lauri Alanko, Nov 3, 2011.

  1. Lauri Alanko

    Lauri Alanko Guest

    Hello.

    I find C99's language on internationalization features particularly
    hard to decipher, so I'd appreciate some clarifications.

    I'm particularly interested in the relationship of the execution
    character set encoding of character literals and character string
    literals, and its relationship with locale encodings and the wchar_t
    encoding.

    Firstly, is it possible for a locale to use a different encoding for
    the basic execution character set than the compiler uses for the
    literals? If not, doesn't this mean that the locale system (and the C
    standard) are insufficient in an environment where both ASCII- and
    EBCDIC-based encodings can be used?

    If it is possible for a locale to use a completely different encoding,
    then how can ordinary character literals and character string literals
    be converted to the locale's encoding? It is of course possible to
    convert between wide characters and the locale, but how do I convert
    from the literal encoding into the wchar_t encoding?

    Is it perhaps guaranteed that ((wchar_t) 'a' == L'a')? I haven't seen
    any text to suggest this, and this would mean that an implementation
    couldn't use EBCDIC for character literals and UCS-4 for wchar_t. But
    maybe someone can give a definitive answer?

    It is of course possible to define the mapping manually:

    wchar_t char_to_wchar[] = {
    ['a'] = L'a',
    ['b'] = L'b',
    // ... etc for all of the portable basic character set
    };

    But this seems like horrible redundant hack that I wouldn't like to
    use except as a last resort. Is something like this really necessary
    in order to print out character string literals correctly in all
    locales?


    Lauri
    Lauri Alanko, Nov 3, 2011
    #1
    1. Advertising

  2. Lauri Alanko

    James Kuyper Guest

    On 11/03/2011 04:41 PM, Lauri Alanko wrote:
    ....
    > Is it perhaps guaranteed that ((wchar_t) 'a' == L'a')?


    I'm no expert on internationalization - as a US programmer I've never
    had any need to worry about it. However, that question at least I can
    answer:

    C99 says, in effect, that the above expression is guaranteed to be true
    if the implementation does not pre-define __STDC_MB_NEQ_WC__ (7.17p2).
    6.10.8p1 seems to indicate that definition of the macro with a value of
    1 is mandatory - but that might be an example of poor wording or a
    misinterpretation on my part. It seems inconsistent with the "if" in 7.17p2.
    James Kuyper, Nov 3, 2011
    #2
    1. Advertising

  3. On Nov 3, 10:09 pm, James Kuyper <> wrote:
    > C99 says, in effect, that the above expression is guaranteed to be true
    > if the implementation does not pre-define __STDC_MB_NEQ_WC__ (7.17p2).
    > 6.10.8p1 seems to indicate that definition of the macro with a value of
    > 1 is mandatory - but that might be an example of poor wording or a
    > misinterpretation on my part. It seems inconsistent with the "if" in 7.17p2.


    It's supposed to be in 6.10.8p2, see DR #333, or a draft of C1x in
    which this has been corrected.
    Harald van Dijk, Nov 3, 2011
    #3
  4. Lauri Alanko

    Lauri Alanko Guest

    In article <>,
    Harald van Dijk <> wrote:
    > On Nov 3, 10:09 pm, James Kuyper <> wrote:
    > > C99 says, in effect, that the above expression is guaranteed to be true
    > > if the implementation does not pre-define __STDC_MB_NEQ_WC__ (7.17p2).
    > > 6.10.8p1 seems to indicate that definition of the macro with a value of
    > > 1 is mandatory - but that might be an example of poor wording or a
    > > misinterpretation on my part. It seems inconsistent with the "if" in 7.17p2.

    >
    > It's supposed to be in 6.10.8p2, see DR #333, or a draft of C1x in
    > which this has been corrected.


    Thanks, that is useful. So C99 mandates that for the basic character
    set, chars and the corresponding wchar_t's have the same integer
    value, and C1x makes this guarantee conditional on the presence of the
    macro.

    But is btowc guaranteed to honor this equality in all locales? And, if
    __STDC_MB_NEQ_WC__ is defined, and btowc is the only way to convert a
    char to wchar_t, is it guaranteed to work correctly on integer
    character constants (from the basic character set) in all locales?
    That is, is (btowc('a') == L'a') going to be true in all
    implementations in all legit locales? And if not, how

    The corner case I'm thinking of is of course the situation where the
    native encoding used by integer character literals is EBCDIC, but
    wchar_t uses UCS-4, and the current locale is ASCII-based. So one
    cannot cast from integer character literals to wchar_t, but one also
    cannot use locale-dependent conversion functions. Is this a situation
    that standard C is even able to support?


    Lauri
    Lauri Alanko, Nov 10, 2011
    #4
  5. Lauri Alanko

    Guest

    Lauri Alanko <> wrote:
    > In article <>,
    > Harald van D??k <> wrote:
    > >
    > > It's supposed to be in 6.10.8p2, see DR #333, or a draft of C1x in
    > > which this has been corrected.

    >
    > Thanks, that is useful. So C99 mandates that for the basic character
    > set, chars and the corresponding wchar_t's have the same integer
    > value, and C1x makes this guarantee conditional on the presence of the
    > macro.


    No, there was a production error in N1256 which put the macro in the
    wrong paragraph; it was always supposed to have been conditional.
    --
    Larry Jones

    I hate being good. -- Calvin
    , Nov 10, 2011
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,797
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Pete Elmgreen

    character literals and string

    Pete Elmgreen, Nov 24, 2004, in forum: Java
    Replies:
    3
    Views:
    4,636
  3. John Goche
    Replies:
    8
    Views:
    16,426
  4. raavi
    Replies:
    2
    Views:
    900
    raavi
    Mar 2, 2006
  5. Lawrence D'Oliveiro
    Replies:
    76
    Views:
    1,607
    Arne Vajhøj
    Feb 27, 2011
Loading...

Share This Page