Multicharacter literals

Richard Smith · Aug 22, 2012

I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

Using GCC on i386, I find that

'foo' == ('f' << 16 | 'o' << 8 | 'o');

Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.

It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?

Richard

Öö Tiib · Aug 22, 2012

I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

....

That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.

The newer versions of compilers compile it into same data and instructions what they always did. Some compilers may issue warnings and that is it because there may be is some legacy code that might use it for whatever reasons.

Same thing likely happens with implicit int and K&R function declarations, despite it is kicked out from standards at least some of the compilers still compile it and issue warnings. Legacy code is too sacred to touch.

It is left up to development process (with its possible coding standards, tools and code reviews) how to address usage of all such features.

Richard Smith · Aug 22, 2012

...

That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.

Well, clearly that's not true. Compiler writers don't decide to add
functionality simply "to confuse the heck out of a novice
maintainer."

Multicharacter literals go back to the late 1960s in the B language; C
inherited them from B, and C++ from B. It's easy to see why they
existed in B. For one thing, there was no char type: everything was
an int, even if you only cared about the lowest 8 bits. Optimising
for code size was also far more important than today, and if you were
used to writing in assembler, you'd be used to putting small strings
as immediates. If you look in the B manual, you'll see examples of
multicharacter literals used in this way: effectively, optimised very
short strings.

However, GCC's (perfectly legal) implementation choices doesn't allow
that usage. As you point out, compiler writers don't break
compatibility with old code for no reason, yet here, somewhere along
the line, a compiler vendor evidently decided to implement
multicharacter literals in a way that broke their use as small
strings. It would have been trivial to have implemented them on a
little-endian machine so that they worked as short strings. So I can
only assume there was some other use of multicharacter literals that
was more important to keep working. I am curious as to what that
other, more important use was.

Richard

Öö Tiib · Aug 22, 2012

Well, clearly that's not true. Compiler writers don't decide to add
functionality simply "to confuse the heck out of a novice
maintainer."

You likely misunderstood what i meant. I meant that i do not see there are other reasons "to use it" [in modern C++ code]. The reason why it is (and possibly will be forever) in the C++ language i discussed further. Thanks for adding history about B etc.

As you point out, compiler writers don't break
compatibility with old code for no reason, yet here, somewhere along
the line, a compiler vendor evidently decided to implement
multicharacter literals in a way that broke their use as small
strings.

I do not somehow believe that there is an ultra profitable way to use thosestrange literals. For all cases there must exist less obscure and more elegant and portable code to get exactly same compiled binary.

Richard Damon · Aug 25, 2012

I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

Using GCC on i386, I find that

'foo' == ('f' << 16 | 'o' << 8 | 'o');

Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.

It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?

Richard

It looks like GCC has decided to implement multicharacter literals as a
form of "Base 256" numbers, which is actually a common use for this sort
of thing. This make 't' the same as '\0\0\0t' instead of 't\0\0\0' which
if you think about it is the required meaning for a single character
literal. Since a single character literal MUST place its value in the
bottom of the value, it makes sense to keep this up. It also means that
two character literals are 16 bit values, 4 character literals are 32
bit values, and if you want to allow them, 8 character literals are 64
bit values.

Any encoding of string that puts them in memory order on little endian
machines breaks this very useful property.

Since any program which directly reads binary files with multi-byte
files needs to worry about endian issues, it shouldn't be THAT hard to
have the program consider this header code as a "big endian int" and
thus do the byte reversal on fetching, thus allowing the comparison to
be done to the "natural" format for multicharacter literals.

Encoding of character literals	4	Nov 3, 2011
Poll: Which type would you prefer for UTF-8 string literals in C++0x	17	Aug 31, 2010
literals for an int128 class	7	May 3, 2011
Calling mechanisms and struct literals	1	Nov 27, 2012
Merging of string literals guaranteed by C std?	12	May 25, 2012
integer literals	14	Sep 26, 2010
Questions on various string literals in c++0x	1	Dec 7, 2010
Py-dea: Streamline string literals now!	21	Dec 28, 2011

Multicharacter literals

Richard Smith

Öö Tiib

Richard Smith

Öö Tiib

Richard Damon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads