question on union

Morris Dovey · Feb 13, 2008

Roman said:
What I don't get is how come that un.c[0] and un.c[1] both contain what has
been un.s initialized, i.e. 0x0102. Is it a feature of 'union'?
Why could not we use 'struct' to check how bytes are placed in memory ?

The elements in a structure all occupy separate and distinct
"pieces" of memory, but the elements of a union all occupy a
common "piece" of memory.

user923005 · Feb 13, 2008

Hello, Morris!
You wrote on Tue, 12 Feb 2008 20:32:28 -0500:

??>> What I don't get is how come that un.c[0] and un.c[1] both contain
??>> what has been un.s initialized, i.e. 0x0102. Is it a feature of
??>> 'union'? Why could not we use 'struct' to check how bytes are placed
??>> in memory ?

MD> The elements in a structure all occupy separate and distinct
MD> "pieces" of memory, but the elements of a union all occupy a
MD> common "piece" of memory.
What is the mechanics behind that? Say, in posted example at run-time un.s =
0x0102 and this value (0x0102) occupies some common memory. Is it CPU who is
in charge to lay out un.c value in the memory according to architecture?

The mechanics don't matter. But one possible implementation is to
simply alias several distinct data type addresses to the same address
in memory, with a pointer for each desired type. In the real world,
it probably won't happen that way, since it would be wasteful. More
likely, there is a table in memory somewhere describing the layout for
each distinct union type. That table will have the different ways to
interpret the different members of the union recorded.

In C, you don't have to worry about the physical mechanics of such a
thing. Trust the compiler writers to create a good implementation of
it. And if they don't switch compilers.

Then why do both values look differently, in debugger:

(gdb) p/x un
$3 = {s = 0x102, c = {0x2, 0x1}}
(gdb)

Because you are interpreting one value as one type and the other value
as a different type.

Arthur · Feb 13, 2008

Hello, Morris!
You wrote on Tue, 12 Feb 2008 20:32:28 -0500:

??>> What I don't get is how come that un.c[0] and un.c[1] both contain
??>> what has been un.s initialized, i.e. 0x0102. Is it a feature of
??>> 'union'? Why could not we use 'struct' to check how bytes are placed
??>> in memory ?

MD> The elements in a structure all occupy separate and distinct
MD> "pieces" of memory, but the elements of a union all occupy a
MD> common "piece" of memory.
What is the mechanics behind that? Say, in posted example at run-time un.s =
0x0102 and this value (0x0102) occupies some common memory. Is it CPU who is
in charge to lay out un.c value in the memory according to architecture?
Then why do both values look differently, in debugger:

(gdb) p/x un
$3 = {s = 0x102, c = {0x2, 0x1}}
(gdb)

With best regards, Roman Mashak. E-mail: (e-mail address removed)

Hello! The CPU doesn't place an 'union' according to this
architecture, in fact,
your compiler does so when you compiles your program.

Suppose you define:
union u_tag {
int i;
char c[sizeof(int)];
} u;
and refers it using
u.i = 0x12;
u.c[0] = 0x12;
the compiler will simply convert them into instructions like this:
movl $0x12, _u
movb $0x12, _u
The compiler uses same symbols for u.i and u.c[0].

The reason why they look different in your debugger is that
Intel CPUs use little-endian.
un.s is placed in memory like this:
0x02 0x01
when referred as u.s, it means a short int 0x0102, i.e. s = 0x102
when referred as u.c, it means an array of char, {0x02, 0x01}

Please correct me if I made any mistakes. Have a good day!

Arthur · Feb 13, 2008

Hello, Morris!
You wrote on Tue, 12 Feb 2008 20:32:28 -0500:

Click to expand...

??>> What I don't get is how come that un.c[0] and un.c[1] both contain
??>> what has been un.s initialized, i.e. 0x0102. Is it a feature of
??>> 'union'? Why could not we use 'struct' to check how bytes are placed
??>> in memory ?

Click to expand...

MD> The elements in a structure all occupy separate and distinct
MD> "pieces" of memory, but the elements of a union all occupy a
MD> common "piece" of memory.
What is the mechanics behind that? Say, in posted example at run-time un.s =
0x0102 and this value (0x0102) occupies some common memory. Is it CPU who is
in charge to lay out un.c value in the memory according to architecture?

Click to expand...

The mechanics don't matter. But one possible implementation is to
simply alias several distinct data type addresses to the same address
in memory, with a pointer for each desired type. In the real world,
it probably won't happen that way, since it would be wasteful. More
likely, there is a table in memory somewhere describing the layout for
each distinct union type. That table will have the different ways to
interpret the different members of the union recorded.

In C, you don't have to worry about the physical mechanics of such a
thing. Trust the compiler writers to create a good implementation of
it. And if they don't switch compilers.

Then why do both values look differently, in debugger:

Click to expand...

(gdb) p/x un
$3 = {s = 0x102, c = {0x2, 0x1}}
(gdb)

Click to expand...

Because you are interpreting one value as one type and the other value
as a different type.

Arthur · Feb 13, 2008

Hello, Arthur!
You wrote on Tue, 12 Feb 2008 21:59:55 -0800 (PST):

[skip]
Thanks for your explanations.

A> and refers it using
A> u.i = 0x12;
A> u.c[0] = 0x12;
A> the compiler will simply convert them into instructions like this:
A> movl $0x12, _u
A> movb $0x12, _u
A> The compiler uses same symbols for u.i and u.c[0].

A> The reason why they look different in your debugger is that
A> Intel CPUs use little-endian.
A> un.s is placed in memory like this:
A> 0x02 0x01
A> when referred as u.s, it means a short int 0x0102, i.e. s = 0x102
A> when referred as u.c, it means an array of char, {0x02, 0x01}
But both u.i and u.c are placed in memory on the same little-endian machine,
why do they look differently? I can't catch how it is done.

With best regards, Roman Mashak. E-mail: (e-mail address removed)

Hello! Your compiler stores the information that un.s is a short int
and un.c[] is an array of char. And when you compile your program with
-g, it passes the info to your debugger, so your debugger knows it.

To understand why it looks differently, you have to keep in mind that
both un.c and un.s are symbols that are simply addresses in
memory. (And the two address are the same)

Suppose the union 'un' has been placed in address 0x80490d4, and when
you command 'un.s = 0x102;', the processor will set the one byte
located
at 0x80490d4 to 0x02, and the one at 0x80490d5 to 0x01, since the
Intel
CPU is little-endian

And when you refer to un.s, since sizeof(short) is 2(in most 32-bit
systems),
the CPU will fetch the two byte at 0x80490d4(0x02) and
0x80490d5(0x01), and
connect them, in little-endian. That will be 0x0102, i.e.,
0x102 just as your debugger reports.

But when you refer to un.c, since sizeof(char) is 1, the CPU fetches
one
byte at 0x80490d4(0x02), and present it to the debugger, and then
fetches
the next one.(0x01) It doesn't connect them (in little-endian), so
they
look like what they are placed in the memory, 0x02, 0x01, just as
your debugger reports.

My explanation is lengthily, sorry.

Martin Ambuhl · Feb 13, 2008

Roman said:
Hello,

I'm going through the "UNIX network programming" by R.Stevens and stuck with
the following code, determining the endiannes of a host it is running on:

#include <stdio.h>
#include <stdlib.h>

#define CPU_VENDOR_OS "i686-pc-linux-gnu"

int main(void)
{
union {
short s;
char c[sizeof(short)];
} un;

un.s = 0x0102;
printf("%s: ", CPU_VENDOR_OS);
if (sizeof(short) == 2) {
if (un.c[0] == 1 && un.c[1] == 2)
printf("big-endian\n");
else if (un.c[0] == 2 && un.c[1] == 1)
printf("little-endian\n");
else
printf("unknown\n");
} else
printf("sizeof(short) = %d\n", sizeof(short));

exit(0);
}

What I don't get is how come that un.c[0] and un.c[1] both contain what has
been un.s initialized, i.e. 0x0102. Is it a feature of 'union'?
Why could not we use 'struct' to check how bytes are placed in memory ?

The program is doing a very bad thing. The folks who "explained" why
this code "works" are doing you a disservice. The value of any union
member other than the last stored into is unspecified. _Never_ store
into one member of a union and attempt to access its value though
another except when accessing an indentical common initial segment of
struct members. This is a special exception to the general rule that a
union can contain only one of its component values at a time. Storing
into one member and accessing another is attempting to have the unison
contain more than one component values at a time.

You can accomplish the above with a non-union array into which you
memmove a value.

Mark Bluemel · Feb 13, 2008

Martin said:
Roman said:

Hello,

I'm going through the "UNIX network programming" by R.Stevens and
stuck with the following code, determining the endiannes of a host it
is running on:

#include <stdio.h>
#include <stdlib.h>

#define CPU_VENDOR_OS "i686-pc-linux-gnu"

int main(void)
{
union {
short s;
char c[sizeof(short)];
} un;

un.s = 0x0102;
printf("%s: ", CPU_VENDOR_OS);
if (sizeof(short) == 2) {
if (un.c[0] == 1 && un.c[1] == 2)
printf("big-endian\n");
else if (un.c[0] == 2 && un.c[1] == 1)
printf("little-endian\n");
else
printf("unknown\n");
} else
printf("sizeof(short) = %d\n", sizeof(short));

exit(0);
}

Click to expand...

The program is doing a very bad thing.

I _think_ that's a bit of an overstatement. The program is not intended
to be totally portable C, given that it's included in a book on Unix
programming. I've just glanced at my copy of the book and from context
and comments in the text, it's clear that the gcc compiler is assumed.

Martin Ambuhl · Feb 13, 2008

Mark said:
Martin Ambuhl wrote:

I _think_ that's a bit of an overstatement. The program is not intended
to be totally portable C, given that it's included in a book on Unix
programming.

Then it belongs in a Unix newsgroup, not in comp.lang.c

I've just glanced at my copy of the book and from context
and comments in the text, it's clear that the gcc compiler is assumed.

And if gcc is significant, and if you are allergic to posting in the
Unix newsgroups, your second choice is a gnu newsgroup.

The people "answering" your question did you two gross disservices
1) they told you that undefined behavior was defined and
2) they led you to believe that off-topic posts were OK.

Mark Bluemel · Feb 13, 2008

Martin said:
Then it belongs in a Unix newsgroup, not in comp.lang.c

Perhaps, but the OP didn't realise that. (Note that I am not the OP).

The people "answering" your question did you two gross disservices

Again, it was not my question.

1) they told you that undefined behavior was defined and
2) they led you to believe that off-topic posts were OK.

I'm not convinced it was an off-topic post.
* The OP saw a piece of code, didn't quite "get it" and asked for
clarification.
* Some people answered his query inaccurately.
* You answered somewhat harshly, but accurately, as far as I know.
* I chose to add some further clarification.

To the Original Poster:

The program depends on behaviour which is not required by the C
standard, but which appears to be dependable in the context in
which the original author wrote it.

Following Martin's suggestion, the program could perhaps be better
written as :-

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define CPU_VENDOR_OS "i686-pc-linux-gnu"

int main(void)
{
short s;
char c[sizeof(short)];

s = 0x0102;
printf("%s: ", CPU_VENDOR_OS);
if (sizeof(short) == 2) {
memcpy((void *)&c, (void *)&s, sizeof(short));
if (c[0] == 1 && c[1] == 2)
printf("big-endian\n");
else if (c[0] == 2 && c[1] == 1)
printf("little-endian\n");
else
printf("unknown\n");
} else
printf("sizeof(short) = %d\n", sizeof(short));

exit(0);
}

Martin · Feb 13, 2008

_Never_ store into one member of a union and attempt to access its value though
another except when accessing an indentical common initial segment of
struct members. This is a special exception to the general rule that a
union can contain only one of its component values at a time. Storing
into one member and accessing another is attempting to have the unison
contain more than one component values at a time.

Does that mean that the answer to Summit's "C Programming FAQs"
Question 20.9 is wrong? Viz.:

union {
int i;
char c[sizeof(int)];
} x;

x.i = 1;

if (x.c[0] == 1)
printf("little-endian\n");
else
printf("big-endian\n");

Martin Ambuhl · Feb 13, 2008

Martin said:
_Never_ store into one member of a union and attempt to access its value though
another except when accessing an indentical common initial segment of
struct members. This is a special exception to the general rule that a
union can contain only one of its component values at a time. Storing
into one member and accessing another is attempting to have the unison
contain more than one component values at a time.

Click to expand...

Does that mean that the answer to Summit's "C Programming FAQs"
Question 20.9 is wrong? Viz.:

union {
int i;
char c[sizeof(int)];
} x;

x.i = 1;

if (x.c[0] == 1)
printf("little-endian\n");
else
printf("big-endian\n");

Yes, it does. Notice that this snippet corresponds to Harbison &
Steele's example program in 6.1.2 "Byte Ordering". H&S correctly
introduce it with this text: "Here is a program that determines a
computer's byte ordering by using a union in a nonportable fashion."
The FAQ's reference is to an older edition of H&S, so I don't know if
that text was there. If that text was there, Steve ought not to have
suppressed it. In any case, if he keeps that example he ought to add
such a disclaimer. Nonportable uses of the language in the FAQ ought to
be flagged. Interestingly, the nonportable use of the language makes
this code worthless, since it is designed to tell you something about
nonportable aspects of an implementation.

Roman Mashak · Feb 13, 2008

Hello,

I'm going through the "UNIX network programming" by R.Stevens and stuck with
the following code, determining the endiannes of a host it is running on:

#include <stdio.h>
#include <stdlib.h>

#define CPU_VENDOR_OS "i686-pc-linux-gnu"

int main(void)
{
union {
short s;
char c[sizeof(short)];
} un;

un.s = 0x0102;
printf("%s: ", CPU_VENDOR_OS);
if (sizeof(short) == 2) {
if (un.c[0] == 1 && un.c[1] == 2)
printf("big-endian\n");
else if (un.c[0] == 2 && un.c[1] == 1)
printf("little-endian\n");
else
printf("unknown\n");
} else
printf("sizeof(short) = %d\n", sizeof(short));

exit(0);
}

What I don't get is how come that un.c[0] and un.c[1] both contain what has
been un.s initialized, i.e. 0x0102. Is it a feature of 'union'?
Why could not we use 'struct' to check how bytes are placed in memory ?

Thanks in advance!

With best regards, Roman Mashak. E-mail: (e-mail address removed)

Roman Mashak · Feb 13, 2008

Hello, Morris!
You wrote on Tue, 12 Feb 2008 20:32:28 -0500:

??>> What I don't get is how come that un.c[0] and un.c[1] both contain
??>> what has been un.s initialized, i.e. 0x0102. Is it a feature of
??>> 'union'? Why could not we use 'struct' to check how bytes are placed
??>> in memory ?

MD> The elements in a structure all occupy separate and distinct
MD> "pieces" of memory, but the elements of a union all occupy a
MD> common "piece" of memory.
What is the mechanics behind that? Say, in posted example at run-time un.s =
0x0102 and this value (0x0102) occupies some common memory. Is it CPU who is
in charge to lay out un.c value in the memory according to architecture?
Then why do both values look differently, in debugger:

(gdb) p/x un
$3 = {s = 0x102, c = {0x2, 0x1}}
(gdb)

With best regards, Roman Mashak. E-mail: (e-mail address removed)

Ben Bacarisse · Feb 13, 2008

Martin Ambuhl said:
Roman said:

I'm going through the "UNIX network programming" by R.Stevens and
stuck with the following code, determining the endiannes of a host
it is running on:

#include <stdio.h>
#include <stdlib.h>

#define CPU_VENDOR_OS "i686-pc-linux-gnu"

int main(void)
{
union {
short s;
char c[sizeof(short)];
} un;

un.s = 0x0102;
printf("%s: ", CPU_VENDOR_OS);
if (sizeof(short) == 2) {
if (un.c[0] == 1 && un.c[1] == 2)
printf("big-endian\n");
else if (un.c[0] == 2 && un.c[1] == 1)
printf("little-endian\n");
else
printf("unknown\n");
} else
printf("sizeof(short) = %d\n", sizeof(short));

exit(0);
}

What I don't get is how come that un.c[0] and un.c[1] both contain
what has been un.s initialized, i.e. 0x0102. Is it a feature of
union'?
Why could not we use 'struct' to check how bytes are placed in memory ?

Click to expand...

The program is doing a very bad thing. The folks who "explained" why
this code "works" are doing you a disservice. The value of any union
member other than the last stored into is unspecified.

Can you cite the prohibition? I thought it had been removed. There
is a footnote (yes, I know, non-normative) that states:

If the member used to access the contents of a union object is not
the same as the member last used to store a value in the object, the
appropriate part of the object representation of the value is
reinterpreted as an object representation in the new type as
described in 6.2.6 (a process sometimes called "type punning"). This
might be a trap representation.

(6.3.6 is the section of the representation of types.) Since unsigned
char can't have trap representations, I think the code above could be
re-written to stay within the letter of C99. The intent seems clear:
to allow type punning using a union.

Michael Mair · Feb 13, 2008

Ben said:
Martin Ambuhl said:

Roman said:

I'm going through the "UNIX network programming" by R.Stevens and
stuck with the following code, determining the endiannes of a host
it is running on:

#include <stdio.h>
#include <stdlib.h>

#define CPU_VENDOR_OS "i686-pc-linux-gnu"

int main(void)
{
union {
short s;
char c[sizeof(short)];
} un;

un.s = 0x0102;
printf("%s: ", CPU_VENDOR_OS);
if (sizeof(short) == 2) {
if (un.c[0] == 1 && un.c[1] == 2)
printf("big-endian\n");
else if (un.c[0] == 2 && un.c[1] == 1)
printf("little-endian\n");
else
printf("unknown\n");
} else
printf("sizeof(short) = %d\n", sizeof(short));

exit(0);
}

What I don't get is how come that un.c[0] and un.c[1] both contain
what has been un.s initialized, i.e. 0x0102. Is it a feature of
union'?
Why could not we use 'struct' to check how bytes are placed in memory ?

Click to expand...

The program is doing a very bad thing. The folks who "explained" why
this code "works" are doing you a disservice. The value of any union
member other than the last stored into is unspecified.

Click to expand...

Can you cite the prohibition? I thought it had been removed. There
is a footnote (yes, I know, non-normative) that states:

If the member used to access the contents of a union object is not
the same as the member last used to store a value in the object, the
appropriate part of the object representation of the value is
reinterpreted as an object representation in the new type as
described in 6.2.6 (a process sometimes called "type punning"). This
might be a trap representation.

(6.3.6 is the section of the representation of types.) Since unsigned
char can't have trap representations, I think the code above could be
re-written to stay within the letter of C99. The intent seems clear:
to allow type punning using a union.

In the thread starting at
<[email protected]>
Tim Rentsch pointed out

,- From <[email protected]> --
My understanding is that the storing one member of a union in
different memory than another member was the result of unclear
language in the standard, and that the unclear language is
expected to be addressed through a TC. See:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm
`----

Cheers
Michael

Roman Mashak · Feb 13, 2008

Hello, Arthur!
You wrote on Tue, 12 Feb 2008 21:59:55 -0800 (PST):

[skip]
Thanks for your explanations.

A> and refers it using
A> u.i = 0x12;
A> u.c[0] = 0x12;
A> the compiler will simply convert them into instructions like this:
A> movl $0x12, _u
A> movb $0x12, _u
A> The compiler uses same symbols for u.i and u.c[0].

A> The reason why they look different in your debugger is that
A> Intel CPUs use little-endian.
A> un.s is placed in memory like this:
A> 0x02 0x01
A> when referred as u.s, it means a short int 0x0102, i.e. s = 0x102
A> when referred as u.c, it means an array of char, {0x02, 0x01}
But both u.i and u.c are placed in memory on the same little-endian machine,
why do they look differently? I can't catch how it is done.

With best regards, Roman Mashak. E-mail: (e-mail address removed)

Martin Ambuhl · Feb 14, 2008

Ben said:
Can you cite the prohibition?

Appendix J is "informative", but includes explictly:

J.1 Unspecified behavior
1 The following are unspecified:
[...]
-- The value of a union member other than the last one stored into
(6.2.6.1).

Ben Bacarisse · Feb 14, 2008

Martin Ambuhl said:
Ben said:

Can you cite the prohibition?

Click to expand...

Appendix J is "informative", but includes explictly:

J.1 Unspecified behavior
1 The following are unspecified:
[...]
-- The value of a union member other than the last one stored into
(6.2.6.1).

Ah, right. I misunderstood your rather strong prohibition on not
doing this type punnig with a union. The behaviour is unspecified,
but so is the behaviour of your suggested alternative. Using memcpy
and inspecting the result will be no more specified than doing the
union trick. Is your objection to the union method stronger than
this?

Martin · Feb 14, 2008

Yes, it does. Notice that this snippet corresponds
to Harbison & Steele's example program in 6.1.2 "Byte Ordering".
H&S correctly introduce it with this text: "Here is a program
that determines a computer's byte ordering by using a union in
a nonportable fashion." The FAQ's reference is to an older
edition of H&S, so I don't know if that text was there. If that
text was there, Steve ought not to have suppressed it. In any
case, if he keeps that example he ought to add such a disclaimer.
Nonportable uses of the language in the FAQ ought to be flagged.
Interestingly, the nonportable use of the language makes this
code worthless, since it is designed to tell you something about
nonportable aspects of an implementation.

My copy of the book is dated 1996. I don't think there is a later
version.

In the book, as well as the union example I posted, there is also the
example as provided in the online FAQ, which uses a pointer. The
online FAQ and my edition of the book also cross-reference to Harbison
& Steel Sec. 6.1.2 pp. 163-4.

The introductory text you quote is not in my edition of the book.

Roman Mashak · Feb 14, 2008

Hello, Martin!
You wrote on Wed, 13 Feb 2008 05:31:07 -0500:

??>> I _think_ that's a bit of an overstatement. The program is not
??>> intended to be totally portable C, given that it's included in a book
??>> on Unix programming.

MA> Then it belongs in a Unix newsgroup, not in comp.lang.c

I thought the code rather belongs to C forum, because there were no Unix
specific calls. And this turn out to be true, I learned such behavior is
undefined and not portable.
Thanks for your explanations.

??>> I've just glanced at my copy of the book and from context
??>> and comments in the text, it's clear that the gcc compiler is assumed.

MA> And if gcc is significant, and if you are allergic to posting in the
MA> Unix newsgroups, your second choice is a gnu newsgroup.

With best regards, Roman Mashak. E-mail: (e-mail address removed)

Union and strict aliasing	4	Jul 28, 2012
UNION global variabl initialize	10	Sep 12, 2011
Union and pointer casts?	13	Feb 24, 2011
Query abt union	23	Jun 23, 2011
Can one get away with an under-allocated union?	5	Dec 25, 2010
Union of structs with duplicate var names	4	May 10, 2010
using union to force memory to be aligned	3	Aug 7, 2009
Union In C	4	Oct 6, 2008

question on union

Morris Dovey

user923005

Arthur

Arthur

Arthur

Martin Ambuhl

Mark Bluemel

Martin Ambuhl

Mark Bluemel

Martin

Martin Ambuhl

Roman Mashak

Roman Mashak

Ben Bacarisse

Michael Mair

Roman Mashak

Martin Ambuhl

Ben Bacarisse

Martin

Roman Mashak

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads