Union test for endianess

Discussion in 'C Programming' started by Bhasker Penta, Jun 17, 2011.

  1. One way to test for endianess is to use a union:

    void endianTest()
    {
    union // sizeof(int) == 4
    {
    int i;
    char ch[4];
    } U;

    U.i=0x12345678; // writing to int member
    if ( U.ch[0]==0x78 ) // reading from char member
    puts("\nLittle endian");
    else
    puts("\nBig endian");
    }

    Writing to one member of a union and reading from another member is
    implementation defined(K & R). This example is used for testing
    endianess @ c-faq.com. I know that gcc allows this. Is the above
    snippet to test for endianess legal C or C++?
     
    Bhasker Penta, Jun 17, 2011
    #1
    1. Advertisements

  2. Bhasker Penta

    Stefan Ram Guest

    »When a value is stored in a member of an object of union type,
    the bytes of the object representation that do not
    correspond to that member but do correspond to other
    members take unspecified values«
    ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
    ISO/IEC 9899:1999 (E), 6.2.6.1#7

    One also might cast a pointer to int into a pointer to char[],
    but I assume, dereferencing this might also give unspecified
    values in the best case or might result in undefined behavior
    in the worst case?

    Endianess is an implementation detail of a higher
    programming language that the language wants to hide from
    you (information hiding), because usually one does not need
    to know it. One even can serialize and deserialize in either
    a portable or an implementation specific manner without
    knowing this.

    However, each specific C implementation is free to disclose
    this implementation detail in its documentation.

    For such purposes, it might be nice, if standard C would
    define names for all the properties an autoconf script
    usually determines, so that each C implementation could
    predefine them.
     
    Stefan Ram, Jun 17, 2011
    #2
    1. Advertisements

  3. Bhasker Penta

    Ian Collins Guest

    How is that relevant to the question, which assumes sizeof(int) == 4?
    Do what? How is that relevant to, well anything?
    Who ever writes the serialisation code does need to know. If you need
    to know the endianess, you are probably writing serialisation code!
     
    Ian Collins, Jun 17, 2011
    #3
  4. Bhasker Penta

    Shao Miller Guest

    Please note that

    One possible way to help to ensure that 'sizeof (int) == 4' and that you
    have 8-bit bytes is to:

    #define TT_ASSERT(message, test) \
    typedef char (message)[(test) ? 1 : -1]

    TT_ASSERT(INT_IS_NOT_4_BYTES, sizeof (int) == 4);
    TT_ASSERT(NOT_8_BIT_BYTE, CHAR_BIT == 8);
    As far as I know, if 'sizeof (int) == 4' as shown, you can certainly
    read from each element of the 'U.ch' array. C doesn't guarantee that
    'sizeof (int) == 4', of course.

    Combined with the 'TT_ASSERT's above, you could have your union as:

    union {
    unsigned int i;
    unsigned char ch[sizeof (unsigned int)];
    } U;

    (Note that the use of 'unsigned' attempts to avoid any potential sign
    bit complications; the 'TT_ASSERT' might be better off matching, too.)
    If you know that the implementation definitely uses an 8-bit byte, a
    4-byte 'int', and that there are no padding bits and that '0x12345678'
    is within the range of values for 'int', then I'd say yes for "legal C". :)
     
    Shao Miller, Jun 17, 2011
    #4
  5. At least on my machine (Windows 7 64 bit) sizeof(int)==4,
    sizeof(char)==1 and '0x12345678' is within 'int' limit. But the fact
    is we are writing to int member and reading from (different) char
    member. That doesn't go well with union rules. If it is legal in C
    language to reinterpret the content of any object as a char array (or
    char pointer), then I believe above snippet is technically correct C
    code(I may be wrong).
    Eg.
    int i=0x12345678; // sizeof(int) == 4
    char *p=(char *)&i;
    if(*p==0x78) // reinterpreting int i through a char
    pointer
    puts("Little Endian");
    else
    puts("Big Endian");
     
    Bhasker Penta, Jun 17, 2011
    #5
  6. Bhasker Penta

    Shao Miller Guest

    I believe it's quite all right. 6.5.2.3p3 has:

    "A postfix expression followed by the . operator and an identifier
    designates a member of a structure or union object. The value is that of
    the named member, and is an lvalue if the first expression is an lvalue.
    If the first expression has qualified type, the result has the
    so-qualified version of the type of the designated member."

    Since you are using your 'ch' array, its element type is a character
    type, and there are no trap representations for character types. The
    last-stored value for the union has an object representation[6.2.6.1p4]
    and that representation is then used for 'ch'.

    Which union rules are you worried about, in particular?
    "char array": Yes. "char pointer": I think you mean if it's accessed
    via a pointer to a character type. Yes, that's quite often the case.

    One of the guarantees of the character types is that all objects can
    have all of their bits manipulated/inspected via access through a
    character type. This is useful for copying, for example. Scalar types
    other than character types might have trap representations, if I recall
    correctly.

    Another nice thing about character types is that they have the weakest
    alignment requirement; a pointer to a character type can be cast from
    any other pointer-to-object-type because the alignment is fine[6.3.2.3p7].
    Absolutely as legitimate as your previous code. :)

    (Using 'unsigned' variants are "nicer," in my opinion; no sign bit.)
     
    Shao Miller, Jun 17, 2011
    #6
  7. It's probably worth pointing out that this is (a) C99 and (b) has no
    casts!

    Since all that's needed is a test for two of the possible byte orders,
    I'd avoid using a value that might not be a valid int:

    #define X (((union{int i; char ch[sizeof(int)];}){.i=1}).ch[0])

    The tests then become X and !X (so I'd use some other name).
     
    Ben Bacarisse, Jun 17, 2011
    #7
  8. Bhasker Penta

    Stefan Ram Guest

    One might worry about not knowing whether or where C actually
    specifies the value of a certain member. For example, in

    U.i=0x12345678; // writing to int member
    if ( U.ch[0]==0x78 ) // reading from char member

    or

    int i=0x12345678; // sizeof(int) == 4
    char *ch=(char *)&i;

    , we assume that the value *ch is a »window« into the
    in-memory representation of i. but does the C standard
    actually requires an implementation to behave this way
    somewhere? If so, where?
    Yes, it would be nice to know, where one can find this.
    In the best case, all the steps needed to prove that *ch
    really has the semantics as intended above.
     
    Stefan Ram, Jun 17, 2011
    #8
  9. Bhasker Penta

    James Kuyper Guest

    There's a total of 24 possible byte orders for 4-byte integers, and a
    few of the other 22 orders have in fact been used. The other 22 orders
    are generically referred to as "middle-endian", and 5 of them would have
    a value of 0x78 in ch[0]. I once found a web page listing the byte
    orders that had actually been used, and citing specific machines on
    which they had been used - unfortunately, I didn't save it, and have
    been unable to locate it again. Big-endian and little-endian were
    overwhelmingly the most common orders, but the two orders that were next
    most common would set ch[] to {0x34, 0x12, 0x78, 0x56} or {0x56, 0x78,
    0x34, 0x12}. One of those two orders (I'm not sure which) was the one
    used on the PDP-11 where I did my first C programming. There were
    several other orders also in actual use, though far less commonly even
    then those two.
    Neither C nor C++ use the term legal. It contains no syntax errors, it
    has no a constraint violations, no diagnostics are required, and the
    behavior is not undefined, according to the rules of either language. In
    C++ it qualifies as "well-formed code". The closest comparable term in C
    is "strictly conforming", but it doesn't qualify for that: it produces
    different results on different platforms, which is the whole point of
    this particular program, but such platform dependence is prohibited for
    strictly conforming programs.
     
    James Kuyper, Jun 17, 2011
    #9
  10. Bhasker Penta

    James Kuyper Guest

    On 06/17/2011 09:32 AM, James Kuyper wrote:
    ....
    Correction: the second one should have been {0x56, 0x78, 0x12, 0x34}.
     
    James Kuyper, Jun 17, 2011
    #10
  11. Nice info.
    About the snippet, according to you, is the code platform dependent
    even if one is reading from unsigned char members?
     
    Bhasker Penta, Jun 17, 2011
    #11
  12. Bhasker Penta

    Willem Guest

    Ian Collins wrote:
    ) On 06/17/11 03:31 PM, Stefan Ram wrote:
    )> Endianess is an implementation detail of a higher
    )> programming language that the language wants to hide from
    )> you (information hiding), because usually one does not need
    )> to know it. One even can serialize and deserialize in either
    )> a portable or an implementation specific manner without
    )> knowing this.
    )
    ) Who ever writes the serialisation code does need to know. If you need
    ) to know the endianess, you are probably writing serialisation code!

    He just *specifically* stated that who ever writes the serialisation
    code *DOES NOT NEED* to know, so your statement does not make sense.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jun 17, 2011
    #12
  13. Bhasker Penta

    Willem Guest

    China Blue Angels wrote:
    ) In article <-berlin.de>,
    ) -berlin.de (Stefan Ram) wrote:
    )> ?When a value is stored in a member of an object of union type,
    )> the bytes of the object representation that do not
    )> correspond to that member but do correspond to other
    )> members take unspecified values?
    )> ??????????????????
    )
    ) All the bytes of i correspond to bytes of ch, and all the bytes of ch correspond
    ) to bytes of i.
    )
    ) union // sizeof(int) == 4
    ) {
    ) int i;
    ) char ch[4];
    ) } U;

    Or not. It's UNSPECIFIED.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jun 17, 2011
    #13
  14. Bhasker Penta

    James Kuyper Guest

    As far as I can tell, every single byte of the object representation of
    U that correspond to U.ch, is also a byte that corresponds to U.i (and
    vice versa). There are no relevant bytes that take on unspecified
    values. Are you suggesting otherwise? If so, on what grounds?
     
    James Kuyper, Jun 17, 2011
    #14
  15. Bhasker Penta

    James Kuyper Guest

    Of course - the purpose of the code is to produce platform dependent
    behavior: it's supposed to report whether or not the platform is
    big-endian or little-endian. While the code, as written, does not
    correctly perform that test, the test it does perform is also platform
    dependent.
     
    James Kuyper, Jun 17, 2011
    #15
  16. Bhasker Penta

    Shao Miller Guest

    What is "it?" The behaviour? The values for each element of 'ch'?
    Given that 'sizeof (int) == 4', the 'ch' array member exactly overlaps
    the 'i' member. I don't follow you.
     
    Shao Miller, Jun 17, 2011
    #16
  17. And since the representations of int and char are implementation-defined
    (meaning the implementation must document them), you can tell, given the
    implementation's documentation, what values are stored in ch when you
    store a value in i (unless there are padding bits).
     
    Keith Thompson, Jun 17, 2011
    #17
  18. Bhasker Penta

    Shao Miller Guest

    6.2.6.1p4 defines "object representation." The 'i' member and the union
    itself have an object representation.

    6.2.6.1p5 allows for a an lvalue expression ('U.ch[0]') with a character
    type (such as 'char') to read the stored value.

    6.5p7 confirms that an lvalue expression with a character type can read
    the stored value.
    Please see above, plus:

    6.2.6p3 states that the representation of 'unsigned char' is "pure binary."

    6.2.6.2p1 states that 'unsigned char' is not divided into value bits and
    padding bits. Since it has values, that leaves only value bits. That
    means that there is no bit which cannot be accessed.

    The treatment of 'signed char' and implementations where 'char' is akin
    to 'signed char' simply has one of the bits being a sign bit[6.2.6.2p6].

    Do you have alternative interpretations of these?
     
    Shao Miller, Jun 17, 2011
    #18
  19. If I may...
    else if (U.ch[0]==0x34)
    puts("\nPDP-11 endian");
     
    Edward A. Falk, Jun 17, 2011
    #19
  20. C99 6.2.6.1 requires a "pure binary notation" for unsigned bit-fields
    and objects of type unsigned char, with a footnote:

    A positional representation for integers that uses the binary
    digits 0 and 1, in which the values represented by successive
    bits are additive, begin with 1, and are multiplied by successive
    integral powers of 2, except perhaps the bit with the highest
    position.

    The words "positional" and "successive" imply to me that only two
    bit orders are permitted for unsigned char. (Or perhaps just one;
    it's not at all clear that there's even any meaning to the positions
    of the bits beyond the values they represent.)

    6.2.6.2, discussing the representation of unsigned integer types,
    again uses the phrase "pure binary notation".

    Each integer type is required to have the same values for its value
    bits as the corresponding bits in the corresponding unsigned type.

    Though it's not 100% clear what "successive" means. I supppose
    it could just mean traversing the bits in order of the values they
    represent, which isn't necessarily the same as either the order of
    the bits in the constituent bytes or the physical order (if that's
    even meaningful).

    I think that either it permits 32 factorial bit orders for a 32-bit
    integer, or it forbids PDP-11 middle-endian order (and I seriously
    doubt that the latter was intended.)
     
    Keith Thompson, Jun 18, 2011
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.