Union test for endianess

B

Bhasker Penta

One way to test for endianess is to use a union:

void endianTest()
{
union // sizeof(int) == 4
{
int i;
char ch[4];
} U;

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
puts("\nLittle endian");
else
puts("\nBig endian");
}

Writing to one member of a union and reading from another member is
implementation defined(K & R). This example is used for testing
endianess @ c-faq.com. I know that gcc allows this. Is the above
snippet to test for endianess legal C or C++?
 
S

Stefan Ram

Bhasker Penta said:
U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
Writing to one member of a union and reading from another member is
implementation defined(K & R).

»When a value is stored in a member of an object of union type,
the bytes of the object representation that do not
correspond to that member but do correspond to other
members take unspecified values«
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
ISO/IEC 9899:1999 (E), 6.2.6.1#7

One also might cast a pointer to int into a pointer to char[],
but I assume, dereferencing this might also give unspecified
values in the best case or might result in undefined behavior
in the worst case?

Endianess is an implementation detail of a higher
programming language that the language wants to hide from
you (information hiding), because usually one does not need
to know it. One even can serialize and deserialize in either
a portable or an implementation specific manner without
knowing this.

However, each specific C implementation is free to disclose
this implementation detail in its documentation.

For such purposes, it might be nice, if standard C would
define names for all the properties an autoconf script
usually determines, so that each C implementation could
predefine them.
 
I

Ian Collins

Bhasker Penta said:
U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
Writing to one member of a union and reading from another member is
implementation defined(K& R).

»When a value is stored in a member of an object of union type,
the bytes of the object representation that do not
correspond to that member but do correspond to other
members take unspecified values«
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
ISO/IEC 9899:1999 (E), 6.2.6.1#7
How is that relevant to the question, which assumes sizeof(int) == 4?
One also might cast a pointer to int into a pointer to char[],
but I assume, dereferencing this might also give unspecified
values in the best case or might result in undefined behavior
in the worst case?

Do what? How is that relevant to, well anything?
Endianess is an implementation detail of a higher
programming language that the language wants to hide from
you (information hiding), because usually one does not need
to know it. One even can serialize and deserialize in either
a portable or an implementation specific manner without
knowing this.

Who ever writes the serialisation code does need to know. If you need
to know the endianess, you are probably writing serialisation code!
 
S

Shao Miller

One way to test for endianess is to use a union:

void endianTest()
{
union // sizeof(int) == 4
{
int i;
char ch[4];
} U;

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
puts("\nLittle endian");
else
puts("\nBig endian");
}

Please note that

One possible way to help to ensure that 'sizeof (int) == 4' and that you
have 8-bit bytes is to:

#define TT_ASSERT(message, test) \
typedef char (message)[(test) ? 1 : -1]

TT_ASSERT(INT_IS_NOT_4_BYTES, sizeof (int) == 4);
TT_ASSERT(NOT_8_BIT_BYTE, CHAR_BIT == 8);
Writing to one member of a union and reading from another member is
implementation defined(K& R).

As far as I know, if 'sizeof (int) == 4' as shown, you can certainly
read from each element of the 'U.ch' array. C doesn't guarantee that
'sizeof (int) == 4', of course.

Combined with the 'TT_ASSERT's above, you could have your union as:

union {
unsigned int i;
unsigned char ch[sizeof (unsigned int)];
} U;

(Note that the use of 'unsigned' attempts to avoid any potential sign
bit complications; the 'TT_ASSERT' might be better off matching, too.)
This example is used for testing
endianess @ c-faq.com. I know that gcc allows this. Is the above
snippet to test for endianess legal C or C++?

If you know that the implementation definitely uses an 8-bit byte, a
4-byte 'int', and that there are no padding bits and that '0x12345678'
is within the range of values for 'int', then I'd say yes for "legal C". :)
 
B

Bhasker Penta

One way to test for endianess is to use a union:
void endianTest()
{
     union     // sizeof(int) == 4
     {
         int i;
         char ch[4];
     } U;
     U.i=0x12345678; // writing to int member
     if ( U.ch[0]==0x78 )   // reading from char member
         puts("\nLittle endian");
     else
         puts("\nBig endian");
}

Please note that

One possible way to help to ensure that 'sizeof (int) == 4' and that you
have 8-bit bytes is to:

   #define TT_ASSERT(message, test) \
     typedef char (message)[(test) ? 1 : -1]

   TT_ASSERT(INT_IS_NOT_4_BYTES, sizeof (int) == 4);
   TT_ASSERT(NOT_8_BIT_BYTE, CHAR_BIT == 8);
Writing to one member of a union and reading from another member is
implementation defined(K&  R).

As far as I know, if 'sizeof (int) == 4' as shown, you can certainly
read from each element of the 'U.ch' array.  C doesn't guarantee that
'sizeof (int) == 4', of course.

Combined with the 'TT_ASSERT's above, you could have your union as:

   union {
       unsigned int i;
       unsigned char ch[sizeof (unsigned int)];
     } U;

(Note that the use of 'unsigned' attempts to avoid any potential sign
bit complications; the 'TT_ASSERT' might be better off matching, too.)
This example is used for testing
endianess @ c-faq.com. I know that gcc allows this. Is the above
snippet to test for endianess legal C or C++?

If you know that the implementation definitely uses an 8-bit byte, a
4-byte 'int', and that there are no padding bits and that '0x12345678'
is within the range of values for 'int', then I'd say yes for "legal C". :)
If you know that the implementation definitely uses an 8-bit byte, a
4-byte 'int', and that there are no padding bits and that '0x12345678'
is within the range of values for 'int', then I'd say yes for "legal C". :)

At least on my machine (Windows 7 64 bit) sizeof(int)==4,
sizeof(char)==1 and '0x12345678' is within 'int' limit. But the fact
is we are writing to int member and reading from (different) char
member. That doesn't go well with union rules. If it is legal in C
language to reinterpret the content of any object as a char array (or
char pointer), then I believe above snippet is technically correct C
code(I may be wrong).
Eg.
int i=0x12345678; // sizeof(int) == 4
char *p=(char *)&i;
if(*p==0x78) // reinterpreting int i through a char
pointer
puts("Little Endian");
else
puts("Big Endian");
 
S

Shao Miller

One way to test for endianess is to use a union:
void endianTest()
{
union // sizeof(int) == 4
{
int i;
char ch[4];
} U;
U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
puts("\nLittle endian");
else
puts("\nBig endian");
}

Please note that

One possible way to help to ensure that 'sizeof (int) == 4' and that you
have 8-bit bytes is to:

#define TT_ASSERT(message, test) \
typedef char (message)[(test) ? 1 : -1]

TT_ASSERT(INT_IS_NOT_4_BYTES, sizeof (int) == 4);
TT_ASSERT(NOT_8_BIT_BYTE, CHAR_BIT == 8);
Writing to one member of a union and reading from another member is
implementation defined(K& R).

As far as I know, if 'sizeof (int) == 4' as shown, you can certainly
read from each element of the 'U.ch' array. C doesn't guarantee that
'sizeof (int) == 4', of course.

Combined with the 'TT_ASSERT's above, you could have your union as:

union {
unsigned int i;
unsigned char ch[sizeof (unsigned int)];
} U;

(Note that the use of 'unsigned' attempts to avoid any potential sign
bit complications; the 'TT_ASSERT' might be better off matching, too.)
This example is used for testing
endianess @ c-faq.com. I know that gcc allows this. Is the above
snippet to test for endianess legal C or C++?

If you know that the implementation definitely uses an 8-bit byte, a
4-byte 'int', and that there are no padding bits and that '0x12345678'
is within the range of values for 'int', then I'd say yes for "legal C". :)
If you know that the implementation definitely uses an 8-bit byte, a
4-byte 'int', and that there are no padding bits and that '0x12345678'
is within the range of values for 'int', then I'd say yes for "legal C". :)

At least on my machine (Windows 7 64 bit) sizeof(int)==4,
sizeof(char)==1 and '0x12345678' is within 'int' limit. But the fact
is we are writing to int member and reading from (different) char
member. That doesn't go well with union rules.

I believe it's quite all right. 6.5.2.3p3 has:

"A postfix expression followed by the . operator and an identifier
designates a member of a structure or union object. The value is that of
the named member, and is an lvalue if the first expression is an lvalue.
If the first expression has qualified type, the result has the
so-qualified version of the type of the designated member."

Since you are using your 'ch' array, its element type is a character
type, and there are no trap representations for character types. The
last-stored value for the union has an object representation[6.2.6.1p4]
and that representation is then used for 'ch'.

Which union rules are you worried about, in particular?
If it is legal in C
language to reinterpret the content of any object as a char array (or
char pointer), then I believe above snippet is technically correct C
code(I may be wrong).

"char array": Yes. "char pointer": I think you mean if it's accessed
via a pointer to a character type. Yes, that's quite often the case.

One of the guarantees of the character types is that all objects can
have all of their bits manipulated/inspected via access through a
character type. This is useful for copying, for example. Scalar types
other than character types might have trap representations, if I recall
correctly.

Another nice thing about character types is that they have the weakest
alignment requirement; a pointer to a character type can be cast from
any other pointer-to-object-type because the alignment is fine[6.3.2.3p7].
Eg.
int i=0x12345678; // sizeof(int) == 4
char *p=(char *)&i;
if(*p==0x78) // reinterpreting int i through a char
pointer
puts("Little Endian");
else
puts("Big Endian");

Absolutely as legitimate as your previous code. :)

(Using 'unsigned' variants are "nicer," in my opinion; no sign bit.)
 
B

Ben Bacarisse

China Blue Angels said:
One way to test for endianess is to use a union:

void endianTest()
{
union // sizeof(int) == 4
{
int i;
char ch[4];
} U;

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
puts("\nLittle endian");
else
puts("\nBig endian");
}
You can use casts to avoid variables.

#define X (((union{int i; char ch[4];}){.i=0x12345678}).ch[0])
#define littleEndian (X==0x78)
#define bigEndian (X==0x01)

It's probably worth pointing out that this is (a) C99 and (b) has no
casts!

Since all that's needed is a test for two of the possible byte orders,
I'd avoid using a value that might not be a valid int:

#define X (((union{int i; char ch[sizeof(int)];}){.i=1}).ch[0])

The tests then become X and !X (so I'd use some other name).
 
S

Stefan Ram

Shao Miller said:
Which union rules are you worried about, in particular?

One might worry about not knowing whether or where C actually
specifies the value of a certain member. For example, in

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member

or

int i=0x12345678; // sizeof(int) == 4
char *ch=(char *)&i;

, we assume that the value *ch is a »window« into the
in-memory representation of i. but does the C standard
actually requires an implementation to behave this way
somewhere? If so, where?
One of the guarantees of the character types is that all objects can
have all of their bits manipulated/inspected via access through a
character type.

Yes, it would be nice to know, where one can find this.
In the best case, all the steps needed to prove that *ch
really has the semantics as intended above.
 
J

James Kuyper

One way to test for endianess is to use a union:

void endianTest()
{
union // sizeof(int) == 4
{
int i;
char ch[4];
} U;

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
puts("\nLittle endian");
else
puts("\nBig endian");
}

There's a total of 24 possible byte orders for 4-byte integers, and a
few of the other 22 orders have in fact been used. The other 22 orders
are generically referred to as "middle-endian", and 5 of them would have
a value of 0x78 in ch[0]. I once found a web page listing the byte
orders that had actually been used, and citing specific machines on
which they had been used - unfortunately, I didn't save it, and have
been unable to locate it again. Big-endian and little-endian were
overwhelmingly the most common orders, but the two orders that were next
most common would set ch[] to {0x34, 0x12, 0x78, 0x56} or {0x56, 0x78,
0x34, 0x12}. One of those two orders (I'm not sure which) was the one
used on the PDP-11 where I did my first C programming. There were
several other orders also in actual use, though far less commonly even
then those two.
Writing to one member of a union and reading from another member is
implementation defined(K & R). This example is used for testing
endianess @ c-faq.com. I know that gcc allows this. Is the above
snippet to test for endianess legal C or C++?

Neither C nor C++ use the term legal. It contains no syntax errors, it
has no a constraint violations, no diagnostics are required, and the
behavior is not undefined, according to the rules of either language. In
C++ it qualifies as "well-formed code". The closest comparable term in C
is "strictly conforming", but it doesn't qualify for that: it produces
different results on different platforms, which is the whole point of
this particular program, but such platform dependence is prohibited for
strictly conforming programs.
 
J

James Kuyper

On 06/17/2011 09:32 AM, James Kuyper wrote:
....
been unable to locate it again. Big-endian and little-endian were
overwhelmingly the most common orders, but the two orders that were next
most common would set ch[] to {0x34, 0x12, 0x78, 0x56} or {0x56, 0x78,
0x34, 0x12}.

Correction: the second one should have been {0x56, 0x78, 0x12, 0x34}.
 
B

Bhasker Penta

On 06/17/2011 09:32 AM, James Kuyper wrote:
...
been unable to locate it again. Big-endian and little-endian were
overwhelmingly the most common orders, but the two orders that were next
most common would set ch[] to {0x34, 0x12, 0x78, 0x56} or {0x56, 0x78,
0x34, 0x12}.

Correction: the second one should have been {0x56, 0x78, 0x12, 0x34}.

Nice info.
About the snippet, according to you, is the code platform dependent
even if one is reading from unsigned char members?
 
W

Willem

Ian Collins wrote:
) On 06/17/11 03:31 PM, Stefan Ram wrote:
)> Endianess is an implementation detail of a higher
)> programming language that the language wants to hide from
)> you (information hiding), because usually one does not need
)> to know it. One even can serialize and deserialize in either
)> a portable or an implementation specific manner without
)> knowing this.
)
) Who ever writes the serialisation code does need to know. If you need
) to know the endianess, you are probably writing serialisation code!

He just *specifically* stated that who ever writes the serialisation
code *DOES NOT NEED* to know, so your statement does not make sense.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Willem

China Blue Angels wrote:
) In article <[email protected]>,
) (e-mail address removed)-berlin.de (Stefan Ram) wrote:
)> ?When a value is stored in a member of an object of union type,
)> the bytes of the object representation that do not
)> correspond to that member but do correspond to other
)> members take unspecified values?
)> ??????????????????
)
) All the bytes of i correspond to bytes of ch, and all the bytes of ch correspond
) to bytes of i.
)
) union // sizeof(int) == 4
) {
) int i;
) char ch[4];
) } U;

Or not. It's UNSPECIFIED.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
J

James Kuyper

China Blue Angels wrote:
) In article <[email protected]>,
) (e-mail address removed)-berlin.de (Stefan Ram) wrote:
)> ?When a value is stored in a member of an object of union type,
)> the bytes of the object representation that do not
)> correspond to that member but do correspond to other
)> members take unspecified values?
)> ??????????????????
)
) All the bytes of i correspond to bytes of ch, and all the bytes of ch correspond
) to bytes of i.
)
) union // sizeof(int) == 4
) {
) int i;
) char ch[4];
) } U;

Or not. It's UNSPECIFIED.

As far as I can tell, every single byte of the object representation of
U that correspond to U.ch, is also a byte that corresponds to U.i (and
vice versa). There are no relevant bytes that take on unspecified
values. Are you suggesting otherwise? If so, on what grounds?
 
J

James Kuyper

On 06/17/2011 09:32 AM, James Kuyper wrote:
...
been unable to locate it again. Big-endian and little-endian were
overwhelmingly the most common orders, but the two orders that were next
most common would set ch[] to {0x34, 0x12, 0x78, 0x56} or {0x56, 0x78,
0x34, 0x12}.

Correction: the second one should have been {0x56, 0x78, 0x12, 0x34}.

Nice info.
About the snippet, according to you, is the code platform dependent
even if one is reading from unsigned char members?

Of course - the purpose of the code is to produce platform dependent
behavior: it's supposed to report whether or not the platform is
big-endian or little-endian. While the code, as written, does not
correctly perform that test, the test it does perform is also platform
dependent.
 
S

Shao Miller

China Blue Angels wrote:
) In article<[email protected]>,
) (e-mail address removed)-berlin.de (Stefan Ram) wrote:
)> ?When a value is stored in a member of an object of union type,
)> the bytes of the object representation that do not
)> correspond to that member but do correspond to other
)> members take unspecified values?
)> ??????????????????
)
) All the bytes of i correspond to bytes of ch, and all the bytes of ch correspond
) to bytes of i.
)
) union // sizeof(int) == 4
) {
) int i;
) char ch[4];
) } U;

Or not. It's UNSPECIFIED.

What is "it?" The behaviour? The values for each element of 'ch'?
Given that 'sizeof (int) == 4', the 'ch' array member exactly overlaps
the 'i' member. I don't follow you.
 
K

Keith Thompson

James Kuyper said:
China Blue Angels wrote:
) In article <[email protected]>,
) (e-mail address removed)-berlin.de (Stefan Ram) wrote:
)> ?When a value is stored in a member of an object of union type,
)> the bytes of the object representation that do not
)> correspond to that member but do correspond to other
)> members take unspecified values?
)> ??????????????????
)
) All the bytes of i correspond to bytes of ch, and all the bytes of ch correspond
) to bytes of i.
)
) union // sizeof(int) == 4
) {
) int i;
) char ch[4];
) } U;

Or not. It's UNSPECIFIED.

As far as I can tell, every single byte of the object representation of
U that correspond to U.ch, is also a byte that corresponds to U.i (and
vice versa). There are no relevant bytes that take on unspecified
values. Are you suggesting otherwise? If so, on what grounds?

And since the representations of int and char are implementation-defined
(meaning the implementation must document them), you can tell, given the
implementation's documentation, what values are stored in ch when you
store a value in i (unless there are padding bits).
 
S

Shao Miller

Shao Miller said:
Which union rules are you worried about, in particular?

One might worry about not knowing whether or where C actually
specifies the value of a certain member. For example, in

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member

or

int i=0x12345678; // sizeof(int) == 4
char *ch=(char *)&i;

, we assume that the value *ch is a »window« into the
in-memory representation of i. but does the C standard
actually requires an implementation to behave this way
somewhere? If so, where?

6.2.6.1p4 defines "object representation." The 'i' member and the union
itself have an object representation.

6.2.6.1p5 allows for a an lvalue expression ('U.ch[0]') with a character
type (such as 'char') to read the stored value.

6.5p7 confirms that an lvalue expression with a character type can read
the stored value.
Yes, it would be nice to know, where one can find this.
In the best case, all the steps needed to prove that *ch
really has the semantics as intended above.

Please see above, plus:

6.2.6p3 states that the representation of 'unsigned char' is "pure binary."

6.2.6.2p1 states that 'unsigned char' is not divided into value bits and
padding bits. Since it has values, that leaves only value bits. That
means that there is no bit which cannot be accessed.

The treatment of 'signed char' and implementations where 'char' is akin
to 'signed char' simply has one of the bits being a sign bit[6.2.6.2p6].

Do you have alternative interpretations of these?
 
E

Edward A. Falk

If I may...
void endianTest()
{
union // sizeof(int) == 4
{
int i;
char ch[4];
} U;

U.i=0x12345678; // writing to int member
if ( U.ch[0]==0x78 ) // reading from char member
puts("\nLittle endian");
else if (U.ch[0]==0x34)
puts("\nPDP-11 endian");
 
K

Keith Thompson

christian.bau said:
Is there actually any guarantee in the C Standard that there is such a
thing as "byte order"? I thought with 32 bit ints stored in 32 bits
worth of bytes there would be 32! (32 factorial) possible bit orders?

C99 6.2.6.1 requires a "pure binary notation" for unsigned bit-fields
and objects of type unsigned char, with a footnote:

A positional representation for integers that uses the binary
digits 0 and 1, in which the values represented by successive
bits are additive, begin with 1, and are multiplied by successive
integral powers of 2, except perhaps the bit with the highest
position.

The words "positional" and "successive" imply to me that only two
bit orders are permitted for unsigned char. (Or perhaps just one;
it's not at all clear that there's even any meaning to the positions
of the bits beyond the values they represent.)

6.2.6.2, discussing the representation of unsigned integer types,
again uses the phrase "pure binary notation".

Each integer type is required to have the same values for its value
bits as the corresponding bits in the corresponding unsigned type.

Though it's not 100% clear what "successive" means. I supppose
it could just mean traversing the bits in order of the values they
represent, which isn't necessarily the same as either the order of
the bits in the constituent bytes or the physical order (if that's
even meaningful).

I think that either it permits 32 factorial bit orders for a 32-bit
integer, or it forbids PDP-11 middle-endian order (and I seriously
doubt that the latter was intended.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

C doubt- union 5
Syntax for union parameter 368
UNION global variabl initialize 10
Union and strict aliasing 4
Union of structs with duplicate var names 4
Union In C 4
Portability issues (union, bitfields) 7
union 16

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top