Endianness macros


J

James Harris

(Related to a separate post about htons etc)

In endian.h gcc includes some useful names under the protection of #ifdef
__USE_BSD such as

# if __BYTE_ORDER == __LITTLE_ENDIAN
# define htobe16(x) __bswap_16 (x)
# define htole16(x) (x)
# define be16toh(x) __bswap_16 (x)
# define le16toh(x) (x)

Whether gcc can be encouraged to show them in the environment I am working
in or not, such names are not in all the compilers I am using. I therefore
need to set up some similar operations and can see some 'interesting' issues
over defining them. I am sure that this kind of thing is an oft-asked
question so rather than just asking for suggestions I'll write up what I
have been considering and would much appreciate feedback. I do have some
specific issues in mind.

First and foremost, there seems to be no practical way for the
*preprocessor* to detect the endianness of the target machine. If so, the
options seem to be either to select different endiannesses in the code as in

if little endian
...
else if big endian
...
else
...

or, alternatively, to specify the endianness when the code is compiled. I am
thinking that because each target machine would be different the object code
would have to be different for each. (Some machines such as Arm can operate
in either mode.) So it would be reasonable to produce different object
files. The compiler output directories would have to incude the name of the
target architecture so that a given piece of source code could compile to
each target. Even if the object code included if-endiannness tests such as
those above, only one branch of each such test would ever be used on a given
machine (in a given mode).

I think I could specify the endianness of the target by either including a
build-specific header file or by passing a symbol definition when the
compiler is invoked. If so, is either approach a generically better one to
take or is there another way to get the same effect?

Second, the use of macros is good since, as above, operations that have no
effect can clearly cost nothing at execution time. But why are the above
macro names not capitalised? I usually take capitals as a warning that
something that looks like a function call is really a macro and I need to be
careful about the type of parameter that is used. Are those names
uncapitalised because they are always safe to use as macros?

Third, on architectures where bytes have to be swapped, C code - as with
most HLL code - can be awkward. I tried to illustrate that in the related
post mentioned at the outset. What alternatives are there to writing the
code in C? I have seen headers include inline assembly for byte swapping but
I don't like making C code so unportable. If it's C it should be C! So I am
thinking to either write the long-winded code in C or to have the macro call
a function that is implemented by a separate assembly routine. For what I am
doing there will be a separate assembly layer for each target anyway so it's
not a big departure from what the rest of the code does.

In summary, I would have

a macro to read a 16-bit little endian value
a macro to read a 16-bit big endian value

ditto for writing the values, ditto for any other defined integer types.
Possibly I should have a macro for reading a PDP-endian 32-bit value too, if
I wanted to do the job properly ;-)

The idea is that these macros would be no-ops on the matching architectures
and calls to separate functions where the architecture doesn't match, and
that the choice of which family of macros to use would be controlled by
something specified at compile time.

How does that lot sound?

James
 
Ad

Advertisements

E

Eric Sosman

(Related to a separate post about htons etc)

In endian.h gcc includes some useful names under the protection of #ifdef
__USE_BSD such as
[...]

I think I could specify the endianness of the target by either including a
build-specific header file or by passing a symbol definition when the
compiler is invoked. If so, is either approach a generically better one to
take or is there another way to get the same effect?

A build-specific header has much to recommend it. You will
probably find other stuff to put there in addition to endianness
goodies, including regrettable but necessary things like

#ifdef FROBOZZ_MAGIC_C
#include <stdio.h>
#undef fflush
#define fflush workAroundFrobozzFflushBug
#endif
Second, the use of macros is good since, as above, operations that have no
effect can clearly cost nothing at execution time. But why are the above
macro names not capitalised? I usually take capitals as a warning that
something that looks like a function call is really a macro and I need to be
careful about the type of parameter that is used. Are those names
uncapitalised because they are always safe to use as macros?

I dunno. In light of the __USE_BSD test, perhaps the names
are mandated by BSD. Ask the header's authors.
Third, on architectures where bytes have to be swapped, C code - as with
most HLL code - can be awkward. I tried to illustrate that in the related
post mentioned at the outset. What alternatives are there to writing the
code in C? I have seen headers include inline assembly for byte swapping but
I don't like making C code so unportable. If it's C it should be C! So I am
thinking to either write the long-winded code in C or to have the macro call
a function that is implemented by a separate assembly routine. For what I am
doing there will be a separate assembly layer for each target anyway so it's
not a big departure from what the rest of the code does.

With C99 or later, an `inline' C function is attractive: Safer
than a macro (no fears about argument side-effects), and quite likely
faster than an external assembly function (no call-and-return needed).
In summary, I would have

a macro to read a 16-bit little endian value
a macro to read a 16-bit big endian value

ditto for writing the values, ditto for any other defined integer types.
Possibly I should have a macro for reading a PDP-endian 32-bit value too, if
I wanted to do the job properly ;-)

The idea is that these macros would be no-ops on the matching architectures
and calls to separate functions where the architecture doesn't match, and
that the choice of which family of macros to use would be controlled by
something specified at compile time.

How does that lot sound?

Although htons() and the like are sanctified by long usage,
I personally feel they're symptoms of a leaked abstraction. Data
format conversion belongs "at the edge," not strewn throughout
the program code. Also, the fact that some of these calls may be
no-ops on some machines makes their omission (or redundant use!)
impossible to detect by testing on those machines: They are bugs
waiting to happen.

They're not as bad as gets(), but they're worth avoiding in
the body of the code. Use them for format conversion at the edge
if you like, but don't make the rest of your code worry about
what is, isn't, may be, or might not be in a foreign form.
 
J

Jorgen Grahn

.
Although htons() and the like are sanctified by long usage,
I personally feel they're symptoms of a leaked abstraction. Data
format conversion belongs "at the edge," not strewn throughout
the program code.

Yes! Because if they are strewn thoughout your code, that means that
integers which aren't really integers are strewn thoughout your data
structures.

The bugs can be surprisingly subtle.

/Jorgen
 
E

Eric Sosman

Yes! Because if they are strewn thoughout your code, that means that
integers which aren't really integers are strewn thoughout your data
structures.

The bugs can be surprisingly subtle.

One of the sneakiest I personally ran across involved code that
carefully hton'ed a value before stuffing it into a buffer. What's
wrong with that? Well, the caller had *already* hton'ed the data!
And since hton*() were no-ops on the BigEndian development system,
testing didn't reveal any problem ...
 
J

Joe Pfeiffer

Jorgen Grahn said:
Yes! Because if they are strewn thoughout your code, that means that
integers which aren't really integers are strewn thoughout your data
structures.

The bugs can be surprisingly subtle.

Absolutely. I was stunned to learn that the arguments to the various
TCP/IP calls had to be in network order. That's just crazy -- the
library I see as an application programmer ought to use my platform's
data formats. Conversions ought to happen within the library.

I'll note that we're now wandering pretty far off from the C language.
 
S

Stephen Sprunk

In summary, I would have

a macro to read a 16-bit little endian value
a macro to read a 16-bit big endian value

ditto for writing the values, ditto for any other defined integer
types. Possibly I should have a macro for reading a PDP-endian 32-bit
value too, if I wanted to do the job properly ;-)

The idea is that these macros would be no-ops on the matching
architectures and calls to separate functions where the architecture
doesn't match, and that the choice of which family of macros to use
would be controlled by something specified at compile time.

How does that lot sound?

Nearly any code that has to deal with binary network or disk data will
need something like this. I've seen dozens of variants, and all were
functionally equivalent to what you propose above.

It would be handy if some standards body would take on the problem and
give us standard functions for this purpose so that every project/team
doesn't have to reinvent this wheel. POSIX solved the big-endian data
problem, with ntohl() et al, but they ignored the plethora of
little-endian wire and file formats emanating from the Wintel world.

S
 
Ad

Advertisements

S

Stephen Sprunk

Although htons() and the like are sanctified by long usage, I
personally feel they're symptoms of a leaked abstraction. Data
format conversion belongs "at the edge," not strewn throughout the
program code. Also, the fact that some of these calls may be no-ops
on some machines makes their omission (or redundant use!) impossible
to detect by testing on those machines: They are bugs waiting to
happen.

They're not as bad as gets(), but they're worth avoiding in the body
of the code. Use them for format conversion at the edge if you like,
but don't make the rest of your code worry about what is, isn't, may
be, or might not be in a foreign form.

Most projects I've worked with seem to do endianness-handling as soon as
the data comes in from (or right before it goes out to) the network or
disk, which is as close to the edge as one can get without a formal
presentation layer.

S
 
S

Stephen Sprunk

Absolutely. I was stunned to learn that the arguments to the various
TCP/IP calls had to be in network order. That's just crazy -- the
library I see as an application programmer ought to use my platform's
data formats. Conversions ought to happen within the library.

OTOH, it forces _everyone_ who uses the sockets library to learn about
endianness issues, which is not necessarily a bad thing; many will not
have ever seen or thought about such issues before--and would go on to
write code that doesn't account for it properly otherwise.

S
 
L

Lew Pitcher

Nearly any code that has to deal with binary network or disk data will
need something like this. I've seen dozens of variants, and all were
functionally equivalent to what you propose above.

It would be handy if some standards body would take on the problem and
give us standard functions for this purpose so that every project/team
doesn't have to reinvent this wheel. POSIX solved the big-endian data
problem, with ntohl() et al, but they ignored the plethora of
little-endian wire and file formats emanating from the Wintel world.

Sorry, Steve, but I disagree with that last statement.

POSIX didn't "solve the big-endian data problem, with ntohl()", and they
didn't ignore "the plethora of little-endian wire and file formats".

Instead, they solved the "endian" problem by standardizing on big-endian
over the wire, and further standardizing on the sizes of network big-endian
data. It is regretable that others ignored those standards and implemented
the plethora of confusing "little-endian" formats we see today.

But, you know what they say about standards: "The nice thing about standards
is that you have so many to choose from."

C'est la vie.
 
I

Ian Collins

Joe said:
Absolutely. I was stunned to learn that the arguments to the various
TCP/IP calls had to be in network order. That's just crazy -- the
library I see as an application programmer ought to use my platform's
data formats. Conversions ought to happen within the library.

If you are working down at that low a level, you should be aware of the
issues. I guess a lot of the code originated on big-endian systems, so
byte order wasn't such a big issue. I do wonder sometimes how any
wasted cycles and nasty bugs would have been avoided if Intel had
followed Motorola's lead and adopted a big-endian architecture..
 
J

Joe Pfeiffer

Lew Pitcher said:
Sorry, Steve, but I disagree with that last statement.

POSIX didn't "solve the big-endian data problem, with ntohl()", and they
didn't ignore "the plethora of little-endian wire and file formats".

Instead, they solved the "endian" problem by standardizing on big-endian
over the wire, and further standardizing on the sizes of network big-endian
data. It is regretable that others ignored those standards and implemented
the plethora of confusing "little-endian" formats we see today.

There were little-endian external data formats long enough ago that I'd
be very surprised to learn they don't predate POSIX.
 
Ad

Advertisements

J

Joe Pfeiffer

Ian Collins said:
If you are working down at that low a level, you should be aware of
the issues. I guess a lot of the code originated on big-endian
systems, so byte order wasn't such a big issue. I do wonder sometimes
how any wasted cycles and nasty bugs would have been avoided if Intel
had followed Motorola's lead and adopted a big-endian architecture..

Opening a socket in C isn't what I regard as a particularly low level.
My application code should no more need to be aware of the endianness of
the protocol than it should the order of the fields in the header.

Yes, many wasted cycles and nasty bugs would have been avoided if all
architectures had the same endianness -- either one. Either Motorola's
or the Correct one.
 
I

Ian Collins

Joe said:
Opening a socket in C isn't what I regard as a particularly low level.
My application code should no more need to be aware of the endianness of
the protocol than it should the order of the fields in the header.

Ah, you said "various TCP/IP calls" and I assumed you were talking about
low level IP code, not the socket layer. Even at the socket level, the
library doesn't know what the data is or where it comes from.
 
J

Joe Pfeiffer

Ian Collins said:
Ah, you said "various TCP/IP calls" and I assumed you were talking
about low level IP code, not the socket layer. Even at the socket
level, the library doesn't know what the data is or where it comes
from.

To take a specific example, there's no reason I can see that the bind()
call should require the IP address and port number to be in network
order.
 
I

Ian Collins

Joe said:
To take a specific example, there's no reason I can see that the bind()
call should require the IP address and port number to be in network
order.

I guess most of these interfaces were defined when UNIX systems were
almost exclusively big-endian. Even now, many socket functions (such as
bind) uses a generic object for their parameters, so it makes sense for
the caller to provide data that the functions don't have to alter.
 
J

Joe Pfeiffer

Ian Collins said:
I guess most of these interfaces were defined when UNIX systems were
almost exclusively big-endian. Even now, many socket functions (such
as bind) uses a generic object for their parameters, so it makes sense
for the caller to provide data that the functions don't have to alter.

Given that the first widely distributed Unix was on a PDP-11, and
networking was added in 4.2bsd which to the best of my knowledge was
only available on a VAX at that time, that's not a likely explanation.

That the same people were implementing both the network stack and the
first applications using it, and didn't put a lot of thought into the
details of the interface, strikes me as a much likelier one.
 
Ad

Advertisements

L

Les Cargill

Joe said:
There were little-endian external data formats long enough ago that I'd
be very surprised to learn they don't predate POSIX.

Technically speaking, nothing little endian was ever a standard
*at all* other than a defacto standard. The IETF declared all
Internet Protocol things to be big endian.

Little endian is an example of "a mistake once made must be propagated
at all costs."
 
J

Joe Pfeiffer

Les Cargill said:
Technically speaking, nothing little endian was ever a standard
*at all* other than a defacto standard. The IETF declared all
Internet Protocol things to be big endian.

That certainly depends on what you mean by a standard. The GIF file
format, for instance, uses a little-endian representation of multi-byte
values (yes, I do realize that's not as old as the IP standards). Yes,
IETF declared all IP things to be big endian; that didn't declare
everything else in the universe to be big-endian.
Little endian is an example of "a mistake once made must be propagated
at all costs."

There's no significant difference between them. Big-endian is
infinitesimally easier for people to read; little-endian can be
preferred for the equally irrelevant improvement in internal
consistency.
 
L

Les Cargill

Joe said:
That certainly depends on what you mean by a standard. The GIF file
format, for instance, uses a little-endian representation of multi-byte
values (yes, I do realize that's not as old as the IP standards).

I'd call that a defacto standard - an implementation came first,
then the publication of the format. Same thing with RIFF formats.

The assumption is that at both ends of the "wire" will be
litle-endian machines, so the internals of the format don't matter.

Yes,
IETF declared all IP things to be big endian; that didn't declare
everything else in the universe to be big-endian.

It's a good start. :)
There's no significant difference between them. Big-endian is
infinitesimally easier for people to read; little-endian can be
preferred for the equally irrelevant improvement in internal
consistency.

y oSr'oyas egniy ti seod t'nttam ?re
 
Ad

Advertisements

J

Joe Pfeiffer

Les Cargill said:
I'd call that a defacto standard - an implementation came first,
then the publication of the format. Same thing with RIFF formats.

The assumption is that at both ends of the "wire" will be
litle-endian machines, so the internals of the format don't matter.



It's a good start. :)


y oSr'oyas egniy ti seod t'nttam ?re

Yes, that's exactly what I'm saying.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top