Endianness macros

Discussion in 'C Programming' started by James Harris, Aug 23, 2013.

  1. James Harris

    James Harris Guest

    (Related to a separate post about htons etc)

    In endian.h gcc includes some useful names under the protection of #ifdef
    __USE_BSD such as

    # if __BYTE_ORDER == __LITTLE_ENDIAN
    # define htobe16(x) __bswap_16 (x)
    # define htole16(x) (x)
    # define be16toh(x) __bswap_16 (x)
    # define le16toh(x) (x)

    Whether gcc can be encouraged to show them in the environment I am working
    in or not, such names are not in all the compilers I am using. I therefore
    need to set up some similar operations and can see some 'interesting' issues
    over defining them. I am sure that this kind of thing is an oft-asked
    question so rather than just asking for suggestions I'll write up what I
    have been considering and would much appreciate feedback. I do have some
    specific issues in mind.

    First and foremost, there seems to be no practical way for the
    *preprocessor* to detect the endianness of the target machine. If so, the
    options seem to be either to select different endiannesses in the code as in

    if little endian
    ...
    else if big endian
    ...
    else
    ...

    or, alternatively, to specify the endianness when the code is compiled. I am
    thinking that because each target machine would be different the object code
    would have to be different for each. (Some machines such as Arm can operate
    in either mode.) So it would be reasonable to produce different object
    files. The compiler output directories would have to incude the name of the
    target architecture so that a given piece of source code could compile to
    each target. Even if the object code included if-endiannness tests such as
    those above, only one branch of each such test would ever be used on a given
    machine (in a given mode).

    I think I could specify the endianness of the target by either including a
    build-specific header file or by passing a symbol definition when the
    compiler is invoked. If so, is either approach a generically better one to
    take or is there another way to get the same effect?

    Second, the use of macros is good since, as above, operations that have no
    effect can clearly cost nothing at execution time. But why are the above
    macro names not capitalised? I usually take capitals as a warning that
    something that looks like a function call is really a macro and I need to be
    careful about the type of parameter that is used. Are those names
    uncapitalised because they are always safe to use as macros?

    Third, on architectures where bytes have to be swapped, C code - as with
    most HLL code - can be awkward. I tried to illustrate that in the related
    post mentioned at the outset. What alternatives are there to writing the
    code in C? I have seen headers include inline assembly for byte swapping but
    I don't like making C code so unportable. If it's C it should be C! So I am
    thinking to either write the long-winded code in C or to have the macro call
    a function that is implemented by a separate assembly routine. For what I am
    doing there will be a separate assembly layer for each target anyway so it's
    not a big departure from what the rest of the code does.

    In summary, I would have

    a macro to read a 16-bit little endian value
    a macro to read a 16-bit big endian value

    ditto for writing the values, ditto for any other defined integer types.
    Possibly I should have a macro for reading a PDP-endian 32-bit value too, if
    I wanted to do the job properly ;-)

    The idea is that these macros would be no-ops on the matching architectures
    and calls to separate functions where the architecture doesn't match, and
    that the choice of which family of macros to use would be controlled by
    something specified at compile time.

    How does that lot sound?

    James
     
    James Harris, Aug 23, 2013
    #1
    1. Advertisements

  2. James Harris

    Eric Sosman Guest

    A build-specific header has much to recommend it. You will
    probably find other stuff to put there in addition to endianness
    goodies, including regrettable but necessary things like

    #ifdef FROBOZZ_MAGIC_C
    #include <stdio.h>
    #undef fflush
    #define fflush workAroundFrobozzFflushBug
    #endif
    I dunno. In light of the __USE_BSD test, perhaps the names
    are mandated by BSD. Ask the header's authors.
    With C99 or later, an `inline' C function is attractive: Safer
    than a macro (no fears about argument side-effects), and quite likely
    faster than an external assembly function (no call-and-return needed).
    Although htons() and the like are sanctified by long usage,
    I personally feel they're symptoms of a leaked abstraction. Data
    format conversion belongs "at the edge," not strewn throughout
    the program code. Also, the fact that some of these calls may be
    no-ops on some machines makes their omission (or redundant use!)
    impossible to detect by testing on those machines: They are bugs
    waiting to happen.

    They're not as bad as gets(), but they're worth avoiding in
    the body of the code. Use them for format conversion at the edge
    if you like, but don't make the rest of your code worry about
    what is, isn't, may be, or might not be in a foreign form.
     
    Eric Sosman, Aug 23, 2013
    #2
    1. Advertisements

  3. James Harris

    Jorgen Grahn Guest

    .
    Yes! Because if they are strewn thoughout your code, that means that
    integers which aren't really integers are strewn thoughout your data
    structures.

    The bugs can be surprisingly subtle.

    /Jorgen
     
    Jorgen Grahn, Aug 23, 2013
    #3
  4. James Harris

    Eric Sosman Guest

    One of the sneakiest I personally ran across involved code that
    carefully hton'ed a value before stuffing it into a buffer. What's
    wrong with that? Well, the caller had *already* hton'ed the data!
    And since hton*() were no-ops on the BigEndian development system,
    testing didn't reveal any problem ...
     
    Eric Sosman, Aug 23, 2013
    #4
  5. James Harris

    Joe Pfeiffer Guest

    Absolutely. I was stunned to learn that the arguments to the various
    TCP/IP calls had to be in network order. That's just crazy -- the
    library I see as an application programmer ought to use my platform's
    data formats. Conversions ought to happen within the library.

    I'll note that we're now wandering pretty far off from the C language.
     
    Joe Pfeiffer, Aug 24, 2013
    #5
  6. Nearly any code that has to deal with binary network or disk data will
    need something like this. I've seen dozens of variants, and all were
    functionally equivalent to what you propose above.

    It would be handy if some standards body would take on the problem and
    give us standard functions for this purpose so that every project/team
    doesn't have to reinvent this wheel. POSIX solved the big-endian data
    problem, with ntohl() et al, but they ignored the plethora of
    little-endian wire and file formats emanating from the Wintel world.

    S
     
    Stephen Sprunk, Aug 24, 2013
    #6
  7. Most projects I've worked with seem to do endianness-handling as soon as
    the data comes in from (or right before it goes out to) the network or
    disk, which is as close to the edge as one can get without a formal
    presentation layer.

    S
     
    Stephen Sprunk, Aug 24, 2013
    #7
  8. OTOH, it forces _everyone_ who uses the sockets library to learn about
    endianness issues, which is not necessarily a bad thing; many will not
    have ever seen or thought about such issues before--and would go on to
    write code that doesn't account for it properly otherwise.

    S
     
    Stephen Sprunk, Aug 24, 2013
    #8
  9. James Harris

    Lew Pitcher Guest

    Sorry, Steve, but I disagree with that last statement.

    POSIX didn't "solve the big-endian data problem, with ntohl()", and they
    didn't ignore "the plethora of little-endian wire and file formats".

    Instead, they solved the "endian" problem by standardizing on big-endian
    over the wire, and further standardizing on the sizes of network big-endian
    data. It is regretable that others ignored those standards and implemented
    the plethora of confusing "little-endian" formats we see today.

    But, you know what they say about standards: "The nice thing about standards
    is that you have so many to choose from."

    C'est la vie.
     
    Lew Pitcher, Aug 24, 2013
    #9
  10. James Harris

    Ian Collins Guest

    If you are working down at that low a level, you should be aware of the
    issues. I guess a lot of the code originated on big-endian systems, so
    byte order wasn't such a big issue. I do wonder sometimes how any
    wasted cycles and nasty bugs would have been avoided if Intel had
    followed Motorola's lead and adopted a big-endian architecture..
     
    Ian Collins, Aug 24, 2013
    #10
  11. James Harris

    Joe Pfeiffer Guest

    There were little-endian external data formats long enough ago that I'd
    be very surprised to learn they don't predate POSIX.
     
    Joe Pfeiffer, Aug 25, 2013
    #11
  12. James Harris

    Joe Pfeiffer Guest

    Opening a socket in C isn't what I regard as a particularly low level.
    My application code should no more need to be aware of the endianness of
    the protocol than it should the order of the fields in the header.

    Yes, many wasted cycles and nasty bugs would have been avoided if all
    architectures had the same endianness -- either one. Either Motorola's
    or the Correct one.
     
    Joe Pfeiffer, Aug 25, 2013
    #12
  13. James Harris

    Ian Collins Guest

    Ah, you said "various TCP/IP calls" and I assumed you were talking about
    low level IP code, not the socket layer. Even at the socket level, the
    library doesn't know what the data is or where it comes from.
     
    Ian Collins, Aug 25, 2013
    #13
  14. James Harris

    Joe Pfeiffer Guest

    To take a specific example, there's no reason I can see that the bind()
    call should require the IP address and port number to be in network
    order.
     
    Joe Pfeiffer, Aug 26, 2013
    #14
  15. James Harris

    Ian Collins Guest

    I guess most of these interfaces were defined when UNIX systems were
    almost exclusively big-endian. Even now, many socket functions (such as
    bind) uses a generic object for their parameters, so it makes sense for
    the caller to provide data that the functions don't have to alter.
     
    Ian Collins, Aug 26, 2013
    #15
  16. James Harris

    Joe Pfeiffer Guest

    Given that the first widely distributed Unix was on a PDP-11, and
    networking was added in 4.2bsd which to the best of my knowledge was
    only available on a VAX at that time, that's not a likely explanation.

    That the same people were implementing both the network stack and the
    first applications using it, and didn't put a lot of thought into the
    details of the interface, strikes me as a much likelier one.
     
    Joe Pfeiffer, Aug 26, 2013
    #16
  17. James Harris

    Les Cargill Guest

    Technically speaking, nothing little endian was ever a standard
    *at all* other than a defacto standard. The IETF declared all
    Internet Protocol things to be big endian.

    Little endian is an example of "a mistake once made must be propagated
    at all costs."
     
    Les Cargill, Aug 26, 2013
    #17
  18. James Harris

    Joe Pfeiffer Guest

    That certainly depends on what you mean by a standard. The GIF file
    format, for instance, uses a little-endian representation of multi-byte
    values (yes, I do realize that's not as old as the IP standards). Yes,
    IETF declared all IP things to be big endian; that didn't declare
    everything else in the universe to be big-endian.
    There's no significant difference between them. Big-endian is
    infinitesimally easier for people to read; little-endian can be
    preferred for the equally irrelevant improvement in internal
    consistency.
     
    Joe Pfeiffer, Aug 26, 2013
    #18
  19. James Harris

    Les Cargill Guest

    I'd call that a defacto standard - an implementation came first,
    then the publication of the format. Same thing with RIFF formats.

    The assumption is that at both ends of the "wire" will be
    litle-endian machines, so the internals of the format don't matter.

    It's a good start. :)
    y oSr'oyas egniy ti seod t'nttam ?re
     
    Les Cargill, Aug 26, 2013
    #19
  20. James Harris

    Joe Pfeiffer Guest

    Yes, that's exactly what I'm saying.
     
    Joe Pfeiffer, Aug 26, 2013
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.