Multicharacter literals

Discussion in 'C++' started by Richard Smith, Aug 22, 2012.

  1. I recently encountered some C++ code that made use of multicharacter
    literals -- that is, something that looks like a character literal,
    but contains more than one character:

    int i = 'foo';

    I must admit, I hadn't realised that C++ still allowed these and had
    assumed they went the way of implicit int and K&R-style function
    declarations. The standard tells me that, unsurprisingly, their
    representation implementation-defined (and so does the C standard), so
    my questions here are not about what the standard requires (nor
    whether I should be using them), but rather what implementations
    commonly do and why.

    Using GCC on i386, I find that

    'foo' == ('f' << 16 | 'o' << 8 | 'o');

    Because i386 is little-endian, this implies it lays out the literal as
    "oof\0", and this is confirmed if I look at the object code
    generated. I must admit, this surprised me. Certainly this choice is
    permitted, and it's easiest for the compiler to parse as it's just a
    base-256 integer. But the only sensible reason I can think of for
    using multicharacter literals is when doing binary I/O. Short strings
    the length of the machine word exist in a number of binary formats --
    e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
    and "WAVE" in the WAV audio format. If I were writing in assembly, I
    might well convert these manually to 32-bit integers and then simply
    dump them; and I can possibly imagine wanting to do that in C or C++
    when writing low-level code. But if I do that with GCC's
    multicharacter literals, they have the wrong byte order: I would have
    to dump 'EVAW' instead of 'WAVE'.

    It seems unlikely that GCC would make an inconvenient implementation
    choice for no good reason, so presumably, then, there is (or once was)
    another use for these that's eluding me. Can anyone suggest what it
    is?

    Richard
    Richard Smith, Aug 22, 2012
    #1
    1. Advertising

  2. Richard Smith

    Öö Tiib Guest

    On Wednesday, August 22, 2012 7:35:57 PM UTC+3, Richard Smith wrote:
    > I recently encountered some C++ code that made use of multicharacter
    > literals -- that is, something that looks like a character literal,
    > but contains more than one character:
    >
    > int i = 'foo';
    >
    > I must admit, I hadn't realised that C++ still allowed these and had
    > assumed they went the way of implicit int and K&R-style function
    > declarations. The standard tells me that, unsurprisingly, their
    > representation implementation-defined (and so does the C standard), so
    > my questions here are not about what the standard requires (nor
    > whether I should be using them), but rather what implementations
    > commonly do and why.


    ....

    That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.

    The newer versions of compilers compile it into same data and instructions what they always did. Some compilers may issue warnings and that is it because there may be is some legacy code that might use it for whatever reasons.

    Same thing likely happens with implicit int and K&R function declarations, despite it is kicked out from standards at least some of the compilers still compile it and issue warnings. Legacy code is too sacred to touch.

    It is left up to development process (with its possible coding standards, tools and code reviews) how to address usage of all such features.
    Öö Tiib, Aug 22, 2012
    #2
    1. Advertising

  3. On Aug 22, 6:40 pm, Öö Tiib <> wrote:
    > On Wednesday, August 22, 2012 7:35:57 PM UTC+3, Richard Smith wrote:
    > > I recently encountered some C++ code that made use of multicharacter
    > > literals -- that is, something that looks like a character literal,
    > > but contains more than one character:

    >
    > >   int i = 'foo';

    >
    > > I must admit, I hadn't realised that C++ still allowed these and had
    > > assumed they went the way of implicit int and K&R-style function
    > > declarations. The standard tells me that, unsurprisingly, their
    > > representation implementation-defined (and so does the C standard), so
    > > my questions here are not about what the standard requires (nor
    > > whether I should be using them), but rather what implementations
    > > commonly do and why.

    >
    > ...
    >
    > That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.


    Well, clearly that's not true. Compiler writers don't decide to add
    functionality simply "to confuse the heck out of a novice
    maintainer."

    Multicharacter literals go back to the late 1960s in the B language; C
    inherited them from B, and C++ from B. It's easy to see why they
    existed in B. For one thing, there was no char type: everything was
    an int, even if you only cared about the lowest 8 bits. Optimising
    for code size was also far more important than today, and if you were
    used to writing in assembler, you'd be used to putting small strings
    as immediates. If you look in the B manual, you'll see examples of
    multicharacter literals used in this way: effectively, optimised very
    short strings.

    However, GCC's (perfectly legal) implementation choices doesn't allow
    that usage. As you point out, compiler writers don't break
    compatibility with old code for no reason, yet here, somewhere along
    the line, a compiler vendor evidently decided to implement
    multicharacter literals in a way that broke their use as small
    strings. It would have been trivial to have implemented them on a
    little-endian machine so that they worked as short strings. So I can
    only assume there was some other use of multicharacter literals that
    was more important to keep working. I am curious as to what that
    other, more important use was.

    Richard
    Richard Smith, Aug 22, 2012
    #3
  4. Richard Smith

    Öö Tiib Guest

    On Wednesday, August 22, 2012 9:18:53 PM UTC+3, Richard Smith wrote:
    > On Aug 22, 6:40 pm, Öö Tiib <> wrote:
    > > On Wednesday, August 22, 2012 7:35:57 PM UTC+3, Richard Smith wrote:
    > > > I recently encountered some C++ code that made use of multicharacter
    > > > literals -- that is, something that looks like a character literal,
    > > > but contains more than one character:

    > >
    > > >   int i = 'foo';

    > >
    > > > I must admit, I hadn't realised that C++ still allowed these and had
    > > > assumed they went the way of implicit int and K&R-style function
    > > > declarations. The standard tells me that, unsurprisingly, their
    > > > representation implementation-defined (and so does the C standard), so
    > > > my questions here are not about what the standard requires (nor
    > > > whether I should be using them), but rather what implementations
    > > > commonly do and why.

    > >
    > > ...
    > >
    > > That you further found out. There are really no other reasons to use it(and have never been) but to confuse the heck out of a novice maintainer.

    >
    >
    > Well, clearly that's not true. Compiler writers don't decide to add
    > functionality simply "to confuse the heck out of a novice
    > maintainer."


    You likely misunderstood what i meant. I meant that i do not see there are other reasons "to use it" [in modern C++ code]. The reason why it is (and possibly will be forever) in the C++ language i discussed further. Thanks for adding history about B etc.

    > As you point out, compiler writers don't break
    > compatibility with old code for no reason, yet here, somewhere along
    > the line, a compiler vendor evidently decided to implement
    > multicharacter literals in a way that broke their use as small
    > strings.


    I do not somehow believe that there is an ultra profitable way to use thosestrange literals. For all cases there must exist less obscure and more elegant and portable code to get exactly same compiled binary.
    Öö Tiib, Aug 22, 2012
    #4
  5. On 8/22/12 12:35 PM, Richard Smith wrote:
    > I recently encountered some C++ code that made use of multicharacter
    > literals -- that is, something that looks like a character literal,
    > but contains more than one character:
    >
    > int i = 'foo';
    >
    > I must admit, I hadn't realised that C++ still allowed these and had
    > assumed they went the way of implicit int and K&R-style function
    > declarations. The standard tells me that, unsurprisingly, their
    > representation implementation-defined (and so does the C standard), so
    > my questions here are not about what the standard requires (nor
    > whether I should be using them), but rather what implementations
    > commonly do and why.
    >
    > Using GCC on i386, I find that
    >
    > 'foo' == ('f' << 16 | 'o' << 8 | 'o');
    >
    > Because i386 is little-endian, this implies it lays out the literal as
    > "oof\0", and this is confirmed if I look at the object code
    > generated. I must admit, this surprised me. Certainly this choice is
    > permitted, and it's easiest for the compiler to parse as it's just a
    > base-256 integer. But the only sensible reason I can think of for
    > using multicharacter literals is when doing binary I/O. Short strings
    > the length of the machine word exist in a number of binary formats --
    > e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
    > and "WAVE" in the WAV audio format. If I were writing in assembly, I
    > might well convert these manually to 32-bit integers and then simply
    > dump them; and I can possibly imagine wanting to do that in C or C++
    > when writing low-level code. But if I do that with GCC's
    > multicharacter literals, they have the wrong byte order: I would have
    > to dump 'EVAW' instead of 'WAVE'.
    >
    > It seems unlikely that GCC would make an inconvenient implementation
    > choice for no good reason, so presumably, then, there is (or once was)
    > another use for these that's eluding me. Can anyone suggest what it
    > is?
    >
    > Richard
    >


    It looks like GCC has decided to implement multicharacter literals as a
    form of "Base 256" numbers, which is actually a common use for this sort
    of thing. This make 't' the same as '\0\0\0t' instead of 't\0\0\0' which
    if you think about it is the required meaning for a single character
    literal. Since a single character literal MUST place its value in the
    bottom of the value, it makes sense to keep this up. It also means that
    two character literals are 16 bit values, 4 character literals are 32
    bit values, and if you want to allow them, 8 character literals are 64
    bit values.

    Any encoding of string that puts them in memory order on little endian
    machines breaks this very useful property.

    Since any program which directly reads binary files with multi-byte
    files needs to worry about endian issues, it shouldn't be THAT hard to
    have the program consider this header code as a "big endian int" and
    thus do the byte reversal on fetching, thus allowing the comparison to
    be done to the "natural" format for multicharacter literals.
    Richard Damon, Aug 25, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Al Wilkerson

    literals

    Al Wilkerson, Sep 21, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    1,604
    Jonathan Allen
    Sep 25, 2004
  2. Replies:
    15
    Views:
    1,316
    Jerry Coffin
    Feb 1, 2005
  3. Guest

    Literals

    Guest, Aug 25, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    528
    martin
    Aug 25, 2003
  4. Duncan Welch

    Array of literals or better?

    Duncan Welch, Jul 27, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    319
    Duncan Welch
    Jul 27, 2004
  5. John Goche
    Replies:
    8
    Views:
    16,458
Loading...

Share This Page