Initializing a character array with a string literal?

J

Jef Driesen

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

Thanks,

Jef
 
N

Nick Keighley

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.
 
A

Alf P. Steinbach

* Nick Keighley:
Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

Cheers & hth.,

- Alf
 
J

Jef Driesen

* Nick Keighley:
Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.
 
E

Ersek, Laszlo

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

This makes me think that you want to use the array for "binary
purposes", like writing it to a socket or to a binary file, so that it
leaves the boundaries of the system. In that case, the above
initialization is not portable, because it initializes str[0] .. str[4]
to platform-dependent values.

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is EBCDIC-based, it will amount to

char unsigned str[5] = { 0x88u, 0x85u, 0x93u, 0x93u, 0x96u };

Even if you're sure that the execution character set will be
ASCII-based, the byte array form is much clearer on the issue, in my
opinion.

lacos
 
A

Alf P. Steinbach

* Jef Driesen:
* Nick Keighley:
Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string
literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I
get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.

OK.

C++ solution for that:

typedef unsigned char Byte;
typedef Byte ByteArr5[5];

Byte data[] = "hello";
ByteArr5& bytes = reinterpret_cast<ByteArr5&>( data );

The awkwardness implies that you're working at cross-purposes with the language,
though. E.g. perhaps the size should be a named constant. Or perhaps use a
std::vector or Boost::array or whatever. Or perhaps this part should really be
written in pure C and just accessed from C++. Something.


Cheers, & still hth.,

- Alf
 
J

Jef Driesen

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

This makes me think that you want to use the array for "binary
purposes", like writing it to a socket or to a binary file, so that it
leaves the boundaries of the system. In that case, the above
initialization is not portable, because it initializes str[0] .. str[4]
to platform-dependent values.

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is EBCDIC-based, it will amount to

char unsigned str[5] = { 0x88u, 0x85u, 0x93u, 0x93u, 0x96u };

Even if you're sure that the execution character set will be
ASCII-based, the byte array form is much clearer on the issue, in my
opinion.

It is indeed used as binary data, but the contents happens to be ASCII
data (and a number of zero bytes too, so it's definitely not usable as a
null terminated string). The reason why I like the string literal, is
that it makes the initialization a lot easier to read. If I see

unsigned char str[5] = "hello";
unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};

it's not immediately clear the second variant equals to "hello".

But I have to admit I didn't know that the character 'h' is not always
equal to 0x68. I assumed that for characters in the ASCII range this is
safe?
 
J

Jef Driesen

* Jef Driesen:
* Nick Keighley:
Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string
literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I
get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.

OK.

C++ solution for that:

typedef unsigned char Byte;
typedef Byte ByteArr5[5];

Byte data[] = "hello";
ByteArr5& bytes = reinterpret_cast<ByteArr5&>( data );

The awkwardness implies that you're working at cross-purposes with the language,
though. E.g. perhaps the size should be a named constant. Or perhaps use a
std::vector or Boost::array or whatever. Or perhaps this part should really be
written in pure C and just accessed from C++. Something.

The code is actually written in C. But it uses a number of C99 features
(such as variable declaration that are not at the top of a block) that
are not supported by the msvc C compiler, so I compile it as C++ code.

Thus adjusting my sizeof's is a less ugly solution in my case.
 
T

Tom St Denis

The code is actually written in C. But it uses a number of C99 features
(such as variable declaration that are not at the top of a block) that
are not supported by the msvc C compiler, so I compile it as C++ code.

Thus adjusting my sizeof's is a less ugly solution in my case.

Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

Tom
 
K

Kaz Kylheku

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.
 
E

Ersek, Laszlo

Jef Driesen said:
It is indeed used as binary data, but the contents happens to be ASCII
data (and a number of zero bytes too, so it's definitely not usable as a
null terminated string). The reason why I like the string literal, is
that it makes the initialization a lot easier to read. If I see

unsigned char str[5] = "hello";
unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};

it's not immediately clear the second variant equals to "hello".

And rightfully so, because the second variant does *not* equal "hello"
on an EBCDIC execution character set, for example.

I can offer no solution that is really pleasing to the eye. At best:

/* "hello" encoded in ASCII */
const char unsigned hello[] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

You describe your network protocol as sequences of specific octets. The
first variant doesn't initialize the array to specific octets if you
don't restrict further aspects of your environment. Of course you can
say that the program only works correctly on ASCII-based execution
character sets (I guess that covers the vast majority of systems today).
I just wanted to make you aware of your reliance on the basic execution
character set being encoded in ASCII.

(I used the word "octet" above. I usually check #if 8 == CHAR_BIT and
abort compilation with #error if char doesn't have exactly 8 bits. All
of C89, C99, SUSv1 and SUSv2 permit bigger bytes theoretically. Even if
no actual system with bytes wider than 8 bits might exist that also
supports the BSD sockets interface, I like to spell out this dependency
of my code explicitly.)

But I have to admit I didn't know that the character 'h' is not always
equal to 0x68. I assumed that for characters in the ASCII range this is
safe?

I'd risk it is safe on most systems today. Perhaps you'll want to
document your dependence on the ASCII encoding of the basic execution
character set, instead of changing the code.

lacos
 
E

Ersek, Laszlo

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.

It is not without example, though.

$ less bzip2-1.0.5/CHANGES

----v----
1.0.2
~~~~~

[...]

* Hard-code header byte values, to give correct operation on platforms
using EBCDIC as their native character set (IBM's OS/390).
(Leland Lucius)

[...]
----^----

I agree that documenting reliance on ASCII may be a better way to go
than diminishing the readability of the source for a dubious increase in
portability. Being aware of the issue is useful in any case, IMHO.

lacos
 
E

Ersek, Laszlo

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.

Not to debate your point any further, but I'd like to add the following:


1. In C99, __STDC_ISO_10646__ defined by the implementation implies,
AFAICT, that "hello" will in fact translate to { 0x68u, 0x65u, 0x6Cu,
0x6Cu, 0x6Fu } (and possibly a trailing \0 if space allows). I think
this can be derived from 6.10.8p2, 6.4.5p3, 6.4.4.4p11 and 5.2.1.2p1:

char unsigned s[5] = "hello";
= { 'h', 'e', 'l', 'l', 'o' };
= { L'h', L'e', L'l', L'l', L'o' };
= { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };


2. It seems to me that all four versions of the SUS published till now
were explicitly written with EBCDIC in mind.

v1:
System Interface Definitions
Issue 4, Version 2
4.4 Character Set Description File
paragraph 7

----v----
The charmap file was introduced to resolve problems with the portability
of, especially, /localedef/ sources. This document set assumes that the
portable character set is constant across all locales, but does not
prohibit implementations from supporting two incompatible codings, such
as both ASCII and EBCDIC. Such dual-support implementations should have
all charmaps and /localedef/ sources encoded using one portable character
set, in effect cross-compiling for the other environment. [...]
----^----

v2:
http://www.opengroup.org/onlinepubs/007908775/xbd/charset.html#tag_001_004

v3:
http://www.opengroup.org/onlinepubs/000095399/xrat/xbd_chap06.html#tag_01_06_01

v4:
http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap06.html#tag_21_06_01

v4 is POSIX:2008 too, thus not very old. Citing the linked-to passage of
the v4 rationale:

----v----
A.6.1 Portable Character Set

The portable character set is listed in full so there is no dependency
on the ISO/IEC 646:1991 standard (or historically ASCII) encoded
character set, although the set is identical to the characters defined
in the International Reference version of the ISO/IEC 646:1991 standard.

[...]

The statement about invariance in codesets for the portable character
set is worded to avoid precluding implementations where multiple
incompatible codesets are available (for instance, ASCII and EBCDIC).
[...]
----^----

I hoarded all this stuff together because your post made me ponder
whether these standards I care about do require ASCII-based encodings
from a conforming implementation. They seem not to.

I'm not obsessed with EBCDIC per se. I generally care that my
assumptions about the environment -- not guaranteed by relevant
standards -- are *conscious*.

lacos
 
J

Jef Driesen

Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

Those declarations are not the only C99 feature I'm using. Refactoring
is an option, but there are a lot more urgent items on my todo list if
you know what I mean.

For now, knowing that there is a difference between C and C++, using a
null terminated string works in both cases and is not that ugly to deal
with.
 
J

Jorgen Grahn

["Followup-To:" header set to comp.lang.c.]
Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

For me it's the other way around -- add C99 declarations, and the
biggest reason for refactoring goes away.

But does MSVC really not support C99 features which are as fundamental
as this one? I have no experience with that compiler, but I find it
hard to believe. An old version?

/Jorgen
 
D

Default User

Jorgen said:
For me it's the other way around -- add C99 declarations, and the
biggest reason for refactoring goes away.

But does MSVC really not support C99 features which are as fundamental
as this one? I have no experience with that compiler, but I find it
hard to believe. An old version?

MS has not been particularly receptive towards C99. The version of the
C compiler in MSVC 2005 doesn't support that feature.



Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top