Initializing a character array with a string literal?

Discussion in 'C Programming' started by Jef Driesen, Mar 15, 2010.

  1. Jef Driesen

    Jef Driesen Guest

    Hi,

    Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
    the answer in both C and C++.

    I know it is possible to initialize a character array with a string literal:

    char str[] = "hello";

    which is often more convenient than having to write:

    char str[] = {'h', 'e', 'l', 'l', 'o', 0};

    But in my case the array is not a real string but a byte array. Hence I
    don't want the terminating null character, and I use unsigned char for
    the data type. Now, s it allowed to write this:

    unsigned char str[5] = "hello";

    It works fine with gcc (in C code), but with msvc (in C++ code) I get an
    error "C2117: array bounds overflow".

    So I wonder if this construct is allowed, and whether there is a
    difference in C and C++.

    Thanks,

    Jef
     
    Jef Driesen, Mar 15, 2010
    #1
    1. Advertisements

  2. this is one of those places where C and C++ differ. C allows the
    constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
    workarounds.
     
    Nick Keighley, Mar 15, 2010
    #2
    1. Advertisements

  3. * Nick Keighley:
    The easiest in C++ is just to accept the superfluous nullbyte and write

    unsigned char bytes[] = "hello";

    Cheers & hth.,

    - Alf
     
    Alf P. Steinbach, Mar 15, 2010
    #3
  4. Jef Driesen

    Jef Driesen Guest

    That works, but it's a little bit inconvenient because I use
    sizeof(bytes) in a few places.
     
    Jef Driesen, Mar 15, 2010
    #4
  5. This makes me think that you want to use the array for "binary
    purposes", like writing it to a socket or to a binary file, so that it
    leaves the boundaries of the system. In that case, the above
    initialization is not portable, because it initializes str[0] .. str[4]
    to platform-dependent values.

    If your execution character set is ASCII-based, the above will amount to

    char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

    If your execution character set is EBCDIC-based, it will amount to

    char unsigned str[5] = { 0x88u, 0x85u, 0x93u, 0x93u, 0x96u };

    Even if you're sure that the execution character set will be
    ASCII-based, the byte array form is much clearer on the issue, in my
    opinion.

    lacos
     
    Ersek, Laszlo, Mar 15, 2010
    #5
  6. * Jef Driesen:
    OK.

    C++ solution for that:

    typedef unsigned char Byte;
    typedef Byte ByteArr5[5];

    Byte data[] = "hello";
    ByteArr5& bytes = reinterpret_cast<ByteArr5&>( data );

    The awkwardness implies that you're working at cross-purposes with the language,
    though. E.g. perhaps the size should be a named constant. Or perhaps use a
    std::vector or Boost::array or whatever. Or perhaps this part should really be
    written in pure C and just accessed from C++. Something.


    Cheers, & still hth.,

    - Alf
     
    Alf P. Steinbach, Mar 15, 2010
    #6
  7. Jef Driesen

    Jef Driesen Guest

    It is indeed used as binary data, but the contents happens to be ASCII
    data (and a number of zero bytes too, so it's definitely not usable as a
    null terminated string). The reason why I like the string literal, is
    that it makes the initialization a lot easier to read. If I see

    unsigned char str[5] = "hello";
    unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};

    it's not immediately clear the second variant equals to "hello".

    But I have to admit I didn't know that the character 'h' is not always
    equal to 0x68. I assumed that for characters in the ASCII range this is
    safe?
     
    Jef Driesen, Mar 15, 2010
    #7
  8. Jef Driesen

    Jef Driesen Guest

    The code is actually written in C. But it uses a number of C99 features
    (such as variable declaration that are not at the top of a block) that
    are not supported by the msvc C compiler, so I compile it as C++ code.

    Thus adjusting my sizeof's is a less ugly solution in my case.
     
    Jef Driesen, Mar 15, 2010
    #8
  9. Jef Driesen

    Tom St Denis Guest

    Maybe the solution is to refactor your code so you can declare your
    local variables at the top of block scope? Just saying...

    Tom
     
    Tom St Denis, Mar 15, 2010
    #9
  10. Jef Driesen

    Kaz Kylheku Guest

    If your execution character set is ASCII based, it means that
    you haven't yet managed to install Linux on that old IBM junker you
    got at the swap meet.

    Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
    name of portability to EBCDIC is egregiously moronic, and a disservice
    to whoever is signing your paycheck.
     
    Kaz Kylheku, Mar 15, 2010
    #10
  11. And rightfully so, because the second variant does *not* equal "hello"
    on an EBCDIC execution character set, for example.

    I can offer no solution that is really pleasing to the eye. At best:

    /* "hello" encoded in ASCII */
    const char unsigned hello[] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

    You describe your network protocol as sequences of specific octets. The
    first variant doesn't initialize the array to specific octets if you
    don't restrict further aspects of your environment. Of course you can
    say that the program only works correctly on ASCII-based execution
    character sets (I guess that covers the vast majority of systems today).
    I just wanted to make you aware of your reliance on the basic execution
    character set being encoded in ASCII.

    (I used the word "octet" above. I usually check #if 8 == CHAR_BIT and
    abort compilation with #error if char doesn't have exactly 8 bits. All
    of C89, C99, SUSv1 and SUSv2 permit bigger bytes theoretically. Even if
    no actual system with bytes wider than 8 bits might exist that also
    supports the BSD sockets interface, I like to spell out this dependency
    of my code explicitly.)

    I'd risk it is safe on most systems today. Perhaps you'll want to
    document your dependence on the ASCII encoding of the basic execution
    character set, instead of changing the code.

    lacos
     
    Ersek, Laszlo, Mar 15, 2010
    #11
  12. It is not without example, though.

    $ less bzip2-1.0.5/CHANGES

    ----v----
    1.0.2
    ~~~~~

    [...]

    * Hard-code header byte values, to give correct operation on platforms
    using EBCDIC as their native character set (IBM's OS/390).
    (Leland Lucius)

    [...]
    ----^----

    I agree that documenting reliance on ASCII may be a better way to go
    than diminishing the readability of the source for a dubious increase in
    portability. Being aware of the issue is useful in any case, IMHO.

    lacos
     
    Ersek, Laszlo, Mar 15, 2010
    #12
  13. Not to debate your point any further, but I'd like to add the following:


    1. In C99, __STDC_ISO_10646__ defined by the implementation implies,
    AFAICT, that "hello" will in fact translate to { 0x68u, 0x65u, 0x6Cu,
    0x6Cu, 0x6Fu } (and possibly a trailing \0 if space allows). I think
    this can be derived from 6.10.8p2, 6.4.5p3, 6.4.4.4p11 and 5.2.1.2p1:

    char unsigned s[5] = "hello";
    = { 'h', 'e', 'l', 'l', 'o' };
    = { L'h', L'e', L'l', L'l', L'o' };
    = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };


    2. It seems to me that all four versions of the SUS published till now
    were explicitly written with EBCDIC in mind.

    v1:
    System Interface Definitions
    Issue 4, Version 2
    4.4 Character Set Description File
    paragraph 7

    ----v----
    The charmap file was introduced to resolve problems with the portability
    of, especially, /localedef/ sources. This document set assumes that the
    portable character set is constant across all locales, but does not
    prohibit implementations from supporting two incompatible codings, such
    as both ASCII and EBCDIC. Such dual-support implementations should have
    all charmaps and /localedef/ sources encoded using one portable character
    set, in effect cross-compiling for the other environment. [...]
    ----^----

    v2:
    http://www.opengroup.org/onlinepubs/007908775/xbd/charset.html#tag_001_004

    v3:
    http://www.opengroup.org/onlinepubs/000095399/xrat/xbd_chap06.html#tag_01_06_01

    v4:
    http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap06.html#tag_21_06_01

    v4 is POSIX:2008 too, thus not very old. Citing the linked-to passage of
    the v4 rationale:

    ----v----
    A.6.1 Portable Character Set

    The portable character set is listed in full so there is no dependency
    on the ISO/IEC 646:1991 standard (or historically ASCII) encoded
    character set, although the set is identical to the characters defined
    in the International Reference version of the ISO/IEC 646:1991 standard.

    [...]

    The statement about invariance in codesets for the portable character
    set is worded to avoid precluding implementations where multiple
    incompatible codesets are available (for instance, ASCII and EBCDIC).
    [...]
    ----^----

    I hoarded all this stuff together because your post made me ponder
    whether these standards I care about do require ASCII-based encodings
    from a conforming implementation. They seem not to.

    I'm not obsessed with EBCDIC per se. I generally care that my
    assumptions about the environment -- not guaranteed by relevant
    standards -- are *conscious*.

    lacos
     
    Ersek, Laszlo, Mar 15, 2010
    #13
  14. Jef Driesen

    Jef Driesen Guest

    Those declarations are not the only C99 feature I'm using. Refactoring
    is an option, but there are a lot more urgent items on my todo list if
    you know what I mean.

    For now, knowing that there is a difference between C and C++, using a
    null terminated string works in both cases and is not that ugly to deal
    with.
     
    Jef Driesen, Mar 15, 2010
    #14
  15. Jef Driesen

    Jorgen Grahn Guest

    ["Followup-To:" header set to comp.lang.c.]
    For me it's the other way around -- add C99 declarations, and the
    biggest reason for refactoring goes away.

    But does MSVC really not support C99 features which are as fundamental
    as this one? I have no experience with that compiler, but I find it
    hard to believe. An old version?

    /Jorgen
     
    Jorgen Grahn, Mar 26, 2010
    #15
  16. Jef Driesen

    Default User Guest

    MS has not been particularly receptive towards C99. The version of the
    C compiler in MSVC 2005 doesn't support that feature.



    Brian
     
    Default User, Mar 26, 2010
    #16
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.