Multicharacter literals

R

Richard Smith

I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

Using GCC on i386, I find that

'foo' == ('f' << 16 | 'o' << 8 | 'o');

Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.

It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?

Richard
 
Ö

Öö Tiib

I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

....

That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.

The newer versions of compilers compile it into same data and instructions what they always did. Some compilers may issue warnings and that is it because there may be is some legacy code that might use it for whatever reasons.

Same thing likely happens with implicit int and K&R function declarations, despite it is kicked out from standards at least some of the compilers still compile it and issue warnings. Legacy code is too sacred to touch.

It is left up to development process (with its possible coding standards, tools and code reviews) how to address usage of all such features.
 
R

Richard Smith

...

That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.

Well, clearly that's not true. Compiler writers don't decide to add
functionality simply "to confuse the heck out of a novice
maintainer."

Multicharacter literals go back to the late 1960s in the B language; C
inherited them from B, and C++ from B. It's easy to see why they
existed in B. For one thing, there was no char type: everything was
an int, even if you only cared about the lowest 8 bits. Optimising
for code size was also far more important than today, and if you were
used to writing in assembler, you'd be used to putting small strings
as immediates. If you look in the B manual, you'll see examples of
multicharacter literals used in this way: effectively, optimised very
short strings.

However, GCC's (perfectly legal) implementation choices doesn't allow
that usage. As you point out, compiler writers don't break
compatibility with old code for no reason, yet here, somewhere along
the line, a compiler vendor evidently decided to implement
multicharacter literals in a way that broke their use as small
strings. It would have been trivial to have implemented them on a
little-endian machine so that they worked as short strings. So I can
only assume there was some other use of multicharacter literals that
was more important to keep working. I am curious as to what that
other, more important use was.

Richard
 
Ö

Öö Tiib

Well, clearly that's not true. Compiler writers don't decide to add
functionality simply "to confuse the heck out of a novice
maintainer."

You likely misunderstood what i meant. I meant that i do not see there are other reasons "to use it" [in modern C++ code]. The reason why it is (and possibly will be forever) in the C++ language i discussed further. Thanks for adding history about B etc.
As you point out, compiler writers don't break
compatibility with old code for no reason, yet here, somewhere along
the line, a compiler vendor evidently decided to implement
multicharacter literals in a way that broke their use as small
strings.

I do not somehow believe that there is an ultra profitable way to use thosestrange literals. For all cases there must exist less obscure and more elegant and portable code to get exactly same compiled binary.
 
R

Richard Damon

I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

Using GCC on i386, I find that

'foo' == ('f' << 16 | 'o' << 8 | 'o');

Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.

It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?

Richard

It looks like GCC has decided to implement multicharacter literals as a
form of "Base 256" numbers, which is actually a common use for this sort
of thing. This make 't' the same as '\0\0\0t' instead of 't\0\0\0' which
if you think about it is the required meaning for a single character
literal. Since a single character literal MUST place its value in the
bottom of the value, it makes sense to keep this up. It also means that
two character literals are 16 bit values, 4 character literals are 32
bit values, and if you want to allow them, 8 character literals are 64
bit values.

Any encoding of string that puts them in memory order on little endian
machines breaks this very useful property.

Since any program which directly reads binary files with multi-byte
files needs to worry about endian issues, it shouldn't be THAT hard to
have the program consider this header code as a "big endian int" and
thus do the byte reversal on fetching, thus allowing the comparison to
be done to the "natural" format for multicharacter literals.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top