Initializing a character array with a string literal?

Jef Driesen · Mar 15, 2010

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

Thanks,

Jef

Nick Keighley · Mar 15, 2010

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

Alf P. Steinbach · Mar 15, 2010

* Nick Keighley:

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

Click to expand...

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

Cheers & hth.,

- Alf

Jef Driesen · Mar 15, 2010

* Nick Keighley:

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

Click to expand...

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

Click to expand...

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.

Ersek, Laszlo · Mar 15, 2010

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

This makes me think that you want to use the array for "binary
purposes", like writing it to a socket or to a binary file, so that it
leaves the boundaries of the system. In that case, the above
initialization is not portable, because it initializes str[0] .. str[4]
to platform-dependent values.

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is EBCDIC-based, it will amount to

char unsigned str[5] = { 0x88u, 0x85u, 0x93u, 0x93u, 0x96u };

Even if you're sure that the execution character set will be
ASCII-based, the byte array form is much clearer on the issue, in my
opinion.

lacos

Alf P. Steinbach · Mar 15, 2010

* Jef Driesen:

* Nick Keighley:

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string
literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I
get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

Click to expand...

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

Click to expand...

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.

OK.

C++ solution for that:

typedef unsigned char Byte;
typedef Byte ByteArr5[5];

Byte data[] = "hello";
ByteArr5& bytes = reinterpret_cast<ByteArr5&>( data );

The awkwardness implies that you're working at cross-purposes with the language,
though. E.g. perhaps the size should be a named constant. Or perhaps use a
std::vector or Boost::array or whatever. Or perhaps this part should really be
written in pure C and just accessed from C++. Something.

Cheers, & still hth.,

- Alf

Jef Driesen · Mar 15, 2010

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

Click to expand...

This makes me think that you want to use the array for "binary
purposes", like writing it to a socket or to a binary file, so that it
leaves the boundaries of the system. In that case, the above
initialization is not portable, because it initializes str[0] .. str[4]
to platform-dependent values.

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is EBCDIC-based, it will amount to

char unsigned str[5] = { 0x88u, 0x85u, 0x93u, 0x93u, 0x96u };

Even if you're sure that the execution character set will be
ASCII-based, the byte array form is much clearer on the issue, in my
opinion.

It is indeed used as binary data, but the contents happens to be ASCII
data (and a number of zero bytes too, so it's definitely not usable as a
null terminated string). The reason why I like the string literal, is
that it makes the initialization a lot easier to read. If I see

unsigned char str[5] = "hello";
unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};

it's not immediately clear the second variant equals to "hello".

But I have to admit I didn't know that the character 'h' is not always
equal to 0x68. I assumed that for characters in the ASCII range this is
safe?

Jef Driesen · Mar 15, 2010

* Jef Driesen:

* Nick Keighley:
Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string
literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I
get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

Click to expand...

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.

Click to expand...

OK.

C++ solution for that:

typedef unsigned char Byte;
typedef Byte ByteArr5[5];

Byte data[] = "hello";
ByteArr5& bytes = reinterpret_cast<ByteArr5&>( data );

The awkwardness implies that you're working at cross-purposes with the language,
though. E.g. perhaps the size should be a named constant. Or perhaps use a
std::vector or Boost::array or whatever. Or perhaps this part should really be
written in pure C and just accessed from C++. Something.

The code is actually written in C. But it uses a number of C99 features
(such as variable declaration that are not at the top of a block) that
are not supported by the msvc C compiler, so I compile it as C++ code.

Thus adjusting my sizeof's is a less ugly solution in my case.

Tom St Denis · Mar 15, 2010

The code is actually written in C. But it uses a number of C99 features
(such as variable declaration that are not at the top of a block) that
are not supported by the msvc C compiler, so I compile it as C++ code.

Thus adjusting my sizeof's is a less ugly solution in my case.

Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

Tom

Kaz Kylheku · Mar 15, 2010

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.

Ersek, Laszlo · Mar 15, 2010

Jef Driesen said:
It is indeed used as binary data, but the contents happens to be ASCII
data (and a number of zero bytes too, so it's definitely not usable as a
null terminated string). The reason why I like the string literal, is
that it makes the initialization a lot easier to read. If I see

unsigned char str[5] = "hello";
unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};

it's not immediately clear the second variant equals to "hello".

And rightfully so, because the second variant does *not* equal "hello"
on an EBCDIC execution character set, for example.

I can offer no solution that is really pleasing to the eye. At best:

/* "hello" encoded in ASCII */
const char unsigned hello[] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

You describe your network protocol as sequences of specific octets. The
first variant doesn't initialize the array to specific octets if you
don't restrict further aspects of your environment. Of course you can
say that the program only works correctly on ASCII-based execution
character sets (I guess that covers the vast majority of systems today).
I just wanted to make you aware of your reliance on the basic execution
character set being encoded in ASCII.

(I used the word "octet" above. I usually check #if 8 == CHAR_BIT and
abort compilation with #error if char doesn't have exactly 8 bits. All
of C89, C99, SUSv1 and SUSv2 permit bigger bytes theoretically. Even if
no actual system with bytes wider than 8 bits might exist that also
supports the BSD sockets interface, I like to spell out this dependency
of my code explicitly.)

But I have to admit I didn't know that the character 'h' is not always
equal to 0x68. I assumed that for characters in the ASCII range this is
safe?

I'd risk it is safe on most systems today. Perhaps you'll want to
document your dependence on the ASCII encoding of the basic execution
character set, instead of changing the code.

lacos

Ersek, Laszlo · Mar 15, 2010

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

Click to expand...

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.

It is not without example, though.

$ less bzip2-1.0.5/CHANGES

----v----
1.0.2
~~~~~

[...]

* Hard-code header byte values, to give correct operation on platforms
using EBCDIC as their native character set (IBM's OS/390).
(Leland Lucius)

[...]
----^----

I agree that documenting reliance on ASCII may be a better way to go
than diminishing the readability of the source for a dubious increase in
portability. Being aware of the issue is useful in any case, IMHO.

lacos

Ersek, Laszlo · Mar 15, 2010

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

Click to expand...

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.

Not to debate your point any further, but I'd like to add the following:

1. In C99, __STDC_ISO_10646__ defined by the implementation implies,
AFAICT, that "hello" will in fact translate to { 0x68u, 0x65u, 0x6Cu,
0x6Cu, 0x6Fu } (and possibly a trailing \0 if space allows). I think
this can be derived from 6.10.8p2, 6.4.5p3, 6.4.4.4p11 and 5.2.1.2p1:

char unsigned s[5] = "hello";
= { 'h', 'e', 'l', 'l', 'o' };
= { L'h', L'e', L'l', L'l', L'o' };
= { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

2. It seems to me that all four versions of the SUS published till now
were explicitly written with EBCDIC in mind.

v1:
System Interface Definitions
Issue 4, Version 2
4.4 Character Set Description File
paragraph 7

----v----
The charmap file was introduced to resolve problems with the portability
of, especially, /localedef/ sources. This document set assumes that the
portable character set is constant across all locales, but does not
prohibit implementations from supporting two incompatible codings, such
as both ASCII and EBCDIC. Such dual-support implementations should have
all charmaps and /localedef/ sources encoded using one portable character
set, in effect cross-compiling for the other environment. [...]
----^----

v2:
http://www.opengroup.org/onlinepubs/007908775/xbd/charset.html#tag_001_004

v3:
http://www.opengroup.org/onlinepubs/000095399/xrat/xbd_chap06.html#tag_01_06_01

v4:
http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap06.html#tag_21_06_01

v4 is POSIX:2008 too, thus not very old. Citing the linked-to passage of
the v4 rationale:

----v----
A.6.1 Portable Character Set

The portable character set is listed in full so there is no dependency
on the ISO/IEC 646:1991 standard (or historically ASCII) encoded
character set, although the set is identical to the characters defined
in the International Reference version of the ISO/IEC 646:1991 standard.

[...]

The statement about invariance in codesets for the portable character
set is worded to avoid precluding implementations where multiple
incompatible codesets are available (for instance, ASCII and EBCDIC).
[...]
----^----

I hoarded all this stuff together because your post made me ponder
whether these standards I care about do require ASCII-based encodings
from a conforming implementation. They seem not to.

I'm not obsessed with EBCDIC per se. I generally care that my
assumptions about the environment -- not guaranteed by relevant
standards -- are *conscious*.

lacos

Jef Driesen · Mar 15, 2010

Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

Those declarations are not the only C99 feature I'm using. Refactoring
is an option, but there are a lot more urgent items on my todo list if
you know what I mean.

For now, knowing that there is a difference between C and C++, using a
null terminated string works in both cases and is not that ugly to deal
with.

Jorgen Grahn · Mar 26, 2010

["Followup-To:" header set to comp.lang.c.]

Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

For me it's the other way around -- add C99 declarations, and the
biggest reason for refactoring goes away.

But does MSVC really not support C99 features which are as fundamental
as this one? I have no experience with that compiler, but I find it
hard to believe. An old version?

/Jorgen

Default User · Mar 26, 2010

Jorgen said:
For me it's the other way around -- add C99 declarations, and the
biggest reason for refactoring goes away.

But does MSVC really not support C99 features which are as fundamental
as this one? I have no experience with that compiler, but I find it
hard to believe. An old version?

MS has not been particularly receptive towards C99. The version of the
C compiler in MSVC 2005 doesn't support that feature.

Brian

Converting an Array to a String in JavaScript	7	Sep 22, 2023
STRING - Remove small letters from string	1	Jan 20, 2023
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
string literal initializer	15	Jun 19, 2010
Trouble calling a function with enum parameter	3	Jan 13, 2023

Initializing a character array with a string literal?

Jef Driesen

Nick Keighley

Alf P. Steinbach

Jef Driesen

Ersek, Laszlo

Alf P. Steinbach

Jef Driesen

Jef Driesen

Tom St Denis

Kaz Kylheku

Ersek, Laszlo

Ersek, Laszlo

Ersek, Laszlo

Jef Driesen

Jorgen Grahn

Default User

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads