Using a union to get different representation

C

Clint Olsen

I was wondering if it's considered undefined behavior to use a member of a
union when it wasn't initialized with that member. Example:

typedef unsigned long hval_t;

hval_t hval_init(void)
{
union hval_init_u { double dbl; hval_t hval; };
union hval_init_u phi = { (sqrt(5) + 1) / 2 }; /* golden ratio */

return phi.hval;
}

I saw on a thread here quite some time ago that it's reasonable to use a
char * to index into a double value and copy the bytes to an unsigned long,
but it seems much quicker and easier to do it this way.

I was wanting to use this as my default bit pattern for a hash function.
It doesn't have to be portable. It just needs to be somewhat random, and
this seemed like a reasonable way.

-Clint
 
M

Me

I was wondering if it's considered undefined behavior to use a member of a
union when it wasn't initialized with that member. Example:

typedef unsigned long hval_t;

hval_t hval_init(void)
{
union hval_init_u { double dbl; hval_t hval; };
union hval_init_u phi = { (sqrt(5) + 1) / 2 }; /* golden ratio */

return phi.hval;
}

I can't seem to find it in the C standard but here is text from the C++
standard: "In a union, at most one of the data members can be active at
any time, that is, the value of at most one of the data members can be
stored in a union at any time. [Note: one special guarantee is made in
order to simplify the use of unions: If a POD-union contains several
POD-structs that share a common initial sequence (9.2), and if an
object of this POD-union type contains one of the POD-structs, it is
permitted to inspect the common initial sequence of any of POD-struct
members; see 9.2. ]"

The C99 standard has almost exactly the same the initial common
sequence wording but not the active member wording, which is weird
because it is related to it. I'm taking the C++ wording as the correct
one, so I would say the above is undefined but works in practice due to
people abusing it so much.
I saw on a thread here quite some time ago that it's reasonable to use a
char * to index into a double value and copy the bytes to an unsigned long,
but it seems much quicker and easier to do it this way.

Don't be afraid to use memcpy. Most compilers implement it as an
intrinsic so it should generate the most efficient code (except one
that actually is guaranteed to work).
 
C

Clint Olsen

I can't seem to find it in the C standard but here is text from the C++
standard: "In a union, at most one of the data members can be active at
any time, that is, the value of at most one of the data members can be
stored in a union at any time. [Note: one special guarantee is made in
order to simplify the use of unions: If a POD-union contains several
POD-structs that share a common initial sequence (9.2), and if an object
of this POD-union type contains one of the POD-structs, it is permitted
to inspect the common initial sequence of any of POD-struct members; see
9.2. ]"

The C99 standard has almost exactly the same the initial common sequence
wording but not the active member wording, which is weird because it is
related to it. I'm taking the C++ wording as the correct one, so I would
say the above is undefined but works in practice due to people abusing it
so much.

Don't be afraid to use memcpy. Most compilers implement it as an
intrinsic so it should generate the most efficient code (except one
that actually is guaranteed to work).

I had considered that, but since it was short(er) to implement the bounds
checking for both the unsigned long and double in the for loop directly, I
just copied it manually. Ok, so I was too lazy to write a MIN macro :)
It's just very convenient with the union since it handles any length
differences for me.

I was thinking that perhaps writing and reading from different members
could cause a trap representation for the read value? It certainly
couldn't cause alignment problems as the union guarantees this shouldn't be
a problem.

Thanks for looking into this.

-Clint
 
M

Me

I was thinking that perhaps writing and reading from different members
could cause a trap representation for the read value? It certainly
couldn't cause alignment problems as the union guarantees this shouldn't be
a problem.

No combination of value bits is allowed to be a trap representation on
an unsigned integer but combinations of padding bits are (and the only
thing guaranteed not to have padding bits is unsigned char). In your
original post, you said it didn't have to be portable and I assumed you
knew the format of doubles and unsigned longs on your implementation.
 
C

Clint Olsen

No combination of value bits is allowed to be a trap representation on an
unsigned integer but combinations of padding bits are (and the only thing
guaranteed not to have padding bits is unsigned char). In your original
post, you said it didn't have to be portable and I assumed you knew the
format of doubles and unsigned longs on your implementation.

Actually, I don't know the format, but as you mentioned that isn't a
consequence for me since I don't care. The bits will be 'random' enough to
get the job done.

On page 52 (6.2.6.1) of the standard, it says the following (7):

"When a value is stored in a member of an object union type, the bytes of
the object representation that do not correspond to that member but do
correspond to other members take unspecified values, but the value of the
union object shall not thereby become a trap representation."

So, it would seem that this is also not an issue.

-Clint
 
R

Richard Bos

[ Please do not snip attribution lines of people whom you still quote. ]
I can't seem to find it in the C standard but here is text from the C++
standard:

The C++ Standard has no real bearing on what is and is not allowed in C,
since the two are different languages; C++ was derived from C, but the
type system is one of the areas where they differ, so I wouldn't trust
the C++ Standard in this regard even if it does appear to agree with C.
"In a union, at most one of the data members can be active at
any time, that is, the value of at most one of the data members can be
stored in a union at any time. [Note: one special guarantee is made in
order to simplify the use of unions: If a POD-union

For example, what is a POD-union, and how does it differ from a normal
union? In C, there is no such thing.

However, n1124 (the public draft of C99 + TC2) does contain this, in
6.2.6.1#7:

# When a value is stored in a member of an object of union type, the
# bytes of the object representation that do not correspond to that
# member but do correspond to other members take unspecified values.

(I note also that the wording of 6.2.6.1#6 and 7 is, IMO, much improved
over their counterparts in n869, but that the implications for the OP
remain the same.)

Richard
 
L

Lawrence Kirby

I was wondering if it's considered undefined behavior to use a member of a
union when it wasn't initialized with that member. Example:

The state of definedness is a bit muddy but it is certainly not portable.
typedef unsigned long hval_t;

hval_t hval_init(void)
{
union hval_init_u { double dbl; hval_t hval; };
union hval_init_u phi = { (sqrt(5) + 1) / 2 }; /* golden ratio */

return phi.hval;
}

I saw on a thread here quite some time ago that it's reasonable to use a
char * to index into a double value and copy the bytes to an unsigned long,

That needs to be unsigned char *, plain char doesn't have adequate
properties for this sort of copying.
but it seems much quicker and easier to do it this way.

I was wanting to use this as my default bit pattern for a hash function.

You need to be very careful here, hashing floating point values doesn't
usually make a lot of sense because they are approximations to start with.
Hashing is strongly associated with equality and where equality testing
doesn't make sense hashing doesn't either. And in some cases where
equality testing does make sense (e.g. as part of ordering ordering, e.g.
a sort comparison) hashing isn't applicable.

So what are you doing that requires the hashing of a floating point value?
It doesn't have to be portable. It just needs to be somewhat random, and
this seemed like a reasonable way.

The normal requirement of a hash value is the equal input values hash to
the same result (hence the relationship with equality). Is that what you
are looking for?

If so consider that it is possible for a floating point representation to
have padding bits that don't contribute to the value. It is also possible
to have things like +0 and -0 i.e. different representations of values
that compare equal. IOW two equal values may not hash to the same result.

I guess you could follow the Java bodginess by creating a floating point
equality test method that is consistent with this method of hashing.

Lawrence
 
L

Lawrence Kirby

On Thu, 07 Jul 2005 12:27:47 +0100, Lawrence Kirby wrote:

....
I guess you could follow the Java bodginess by creating a floating point
equality test method that is consistent with this method of hashing.

To be fair to Java the property that NaNs always compare unequal means
that normal comparison semantics are not viable for hashing. But this is
also creates another potential problem here.

Lawrence
 
C

Clint Olsen

That needs to be unsigned char *, plain char doesn't have adequate
properties for this sort of copying.

Oh yes, sorry. I did see it was unsigned char *.
You need to be very careful here, hashing floating point values doesn't
usually make a lot of sense because they are approximations to start
with. Hashing is strongly associated with equality and where equality
testing doesn't make sense hashing doesn't either. And in some cases
where equality testing does make sense (e.g. as part of ordering
ordering, e.g. a sort comparison) hashing isn't applicable.

So what are you doing that requires the hashing of a floating point
value?

I'm actually not interested in hashing a floating point value. This is
just my initial value in the unsigned long accumulator in the event that no
initialization is provided to the function. It's just an arbitrary value.
It just loads an unsigned long with as many bits as are applicable. I
don't have to write some elaborate function to seed the accumulator with a
value.

If you notice, the floating point calculation is the golden ratio.
Calculating the integer representation of the golden ratio is otherwise a
somewhat tedious task. Generally only one bit at a time can be calculated.

Unfortunately, there isn't a standard library function that allows you to
extract (as an unsigned long) the mantissa of a floating point value. That
would be better, but I'm willing to settle for something a bit more crude
especially if it can be coded in one line and doesn't cause my hard drive
to be formatted.
If so consider that it is possible for a floating point representation to
have padding bits that don't contribute to the value. It is also possible
to have things like +0 and -0 i.e. different representations of values
that compare equal. IOW two equal values may not hash to the same result.

Yeah, the location of the mantissa and what-not are totally out of my
control as this is platform dependent. But as I said this is not critical.

-Clint
 
M

Me

Clint said:
I can't seem to find it in the C standard but here is text from the C++
standard: "In a union, at most one of the data members can be active at
any time, that is, the value of at most one of the data members can be
stored in a union at any time. [Note: one special guarantee is made in
order to simplify the use of unions: If a POD-union contains several
POD-structs that share a common initial sequence (9.2), and if an object
of this POD-union type contains one of the POD-structs, it is permitted
to inspect the common initial sequence of any of POD-struct members; see
9.2. ]"

The C99 standard has almost exactly the same the initial common sequence
wording but not the active member wording, which is weird because it is
related to it. I'm taking the C++ wording as the correct one, so I would
say the above is undefined but works in practice due to people abusing it
so much.

Don't be afraid to use memcpy. Most compilers implement it as an
intrinsic so it should generate the most efficient code (except one
that actually is guaranteed to work).

I had considered that, but since it was short(er) to implement the bounds
checking for both the unsigned long and double in the for loop directly, I
just copied it manually. Ok, so I was too lazy to write a MIN macro :)
It's just very convenient with the union since it handles any length
differences for me.

I was thinking that perhaps writing and reading from different members
could cause a trap representation for the read value? It certainly
couldn't cause alignment problems as the union guarantees this shouldn't be
a problem.

Ok, it appears that C99 doesn't have the "active member" wording that
the C++ standard does. I did a bit more detailed searching because this
troubles me and there was some discussion to make union member access
apply to the definition of "effective type" (i.e. you don't need
"active member" wording as "effective type" would take care of it),
which makes sense because the examples you can come up for aliasing
violations apply to both cases. I don't know what ever came from that
discussion but I can't find the relevant sections updated in the TC2
draft (most likely because nobody can agree on the correct wording
yet). But I definitely would be siding with the "active member" wording
from the C++ standard on this one because it seems like the right thing
to do and I have a strong feeling that wording will be made to agree in
a future C standard once somebody comes up with an agreed upon way to
express it. For more information about this, have a look at:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_219.htm
http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_236.htm
search for "alias" and continue reading from that point:
http://www.open-std.org/JTC1/SC22/WG14/www/docs/n973.txt
http://www.open-std.org/JTC1/SC22/WG14/www/docs/n987.txt
 
C

Chris Torek

I was wondering if it's considered undefined behavior to use a member of a
union when it wasn't initialized with that member. ...

Yes, if the member you use is not "unsigned char" (or an array thereof).
Example:

typedef unsigned long hval_t;

hval_t hval_init(void)
{
union hval_init_u { double dbl; hval_t hval; };
union hval_init_u phi = { (sqrt(5) + 1) / 2 }; /* golden ratio */

return phi.hval;
}

See the thread titled "aliasing/Torek's strtod" (or something along
those lines). I ran into a problem using a version of gcc (2.x I
think but maybe 3.x) on the i386 with the strtod() function in
BSD/OS (which was a copy of the one written, I think, by David Gay
and available for anyone to use under reasonably flexible limits).
This code used a union similar to the above, except that the second
member was an array of size 2, so as to exactly overlay all eight
bytes of the (IEEE) double. In other words:

union { double x; int y[2]; } var;

The compiler "knew" that since var.x and var.y could not both
be active simultaneously, it was OK to leave var.x in the CPU's
floating-point registers and manipulate var.y in memory. Thus,
assignments to var.x did not affect var.y, and vice versa.

If your C compiler is similarly clever, it might compute the
golden ratio in a floating-point register, not store that to
memory, and then return the not-stored (i.e., junk) value out
of the memory location.
I was wanting to use this as my default bit pattern for a hash function.
It doesn't have to be portable. It just needs to be somewhat random, and
this seemed like a reasonable way.

It may be entirely random (or seem that way), in that hval_init()
may return a different value depending on what is in memory at the
time, instead of returning the same value on each call.

I might suggest the following variant:


hval_t hval_init(void) {
static int beenhere;
static hval_t phi_bits;
#define min(x,y) ((x) < (y) ? (x) : (y))

if (!beenhere) {
union { double d; unsigned char bytes[sizeof(double)]; } u;

/*
* Initialize the "default value" to a bitwise representation
* of phi, the golden ratio, as stored in a double. Use
* however many bits fit in an hval_t (discarding those
* that do not fit; we ignore endianness so we get different
* answers on different machines but consistent answers within
* any given run of this code on one machine).
*
* If hval_t has more bits than occur in a double, the extra
* bits will be all-zeros.
*/
u.d = (sqrt(5.0) + 1.0) / 2.0;
memcpy(&phi_bits, u.bytes, min(sizeof phi_bits, sizeof u.d));
}
return phi_bits;
#undef min
}

This could still potentially copy a trap representation into
phi_bits, but will actually work on real machines, including the
x86.
 
C

Chris Torek

hval_t hval_init(void) {
static int beenhere;
static hval_t phi_bits;
#define min(x,y) ((x) < (y) ? (x) : (y))

if (!beenhere) {
union { double d; unsigned char bytes[sizeof(double)]; } u; [comment snippage]
u.d = (sqrt(5.0) + 1.0) / 2.0;
memcpy(&phi_bits, u.bytes, min(sizeof phi_bits, sizeof u.d));
}
return phi_bits;
#undef min
}

Of course, inside the test for "!beenhere", one should also set
beenhere:

if (!beenhere) {
... as before ...
beenhere = 1;
}

since the point is to avoid re-computing phi() on each call.
 
C

Clint Olsen

union { double x; int y[2]; } var;

The compiler "knew" that since var.x and var.y could not both be
active simultaneously, it was OK to leave var.x in the CPU's
floating-point registers and manipulate var.y in memory. Thus,
assignments to var.x did not affect var.y, and vice versa.


Since a union is defined as an overlapping set of member objects, would you
consider this a conforming compiler? I suppose I could play around with
the optimizer settings and see if gcc tries to do this.
If your C compiler is similarly clever, it might compute the
golden ratio in a floating-point register, not store that to
memory, and then return the not-stored (i.e., junk) value out
of the memory location.

hval_t hval_init(void) {
static int beenhere;
static hval_t phi_bits;
#define min(x,y) ((x) < (y) ? (x) : (y))

if (!beenhere) {
union { double d; unsigned char bytes[sizeof(double)]; } u;

/*
* Initialize the "default value" to a bitwise representation
* of phi, the golden ratio, as stored in a double. Use
* however many bits fit in an hval_t (discarding those
* that do not fit; we ignore endianness so we get different
* answers on different machines but consistent answers within
* any given run of this code on one machine).
*
* If hval_t has more bits than occur in a double, the extra
* bits will be all-zeros.
*/
u.d = (sqrt(5.0) + 1.0) / 2.0;
memcpy(&phi_bits, u.bytes, min(sizeof phi_bits, sizeof u.d));
}
return phi_bits;
#undef min
}

So, if we're going to resort to just using memcpy, why bother using the
union at all?

hval_t phi_bits;
double d = (sqrt(5.0) + 1.0) / 2.0;

memcpy(&phi_bits, &d, min(sizeof(hval_t), sizeof(double)));
This could still potentially copy a trap representation into phi_bits,
but will actually work on real machines, including the x86.

I mentioned something about trap representations and unions in another post
in this thread. I'm curious if I'm interpreting that correctly.

Thanks,

-Clint
 
C

Chris Torek

union { double x; int y[2]; } var;

The compiler "knew" that since var.x and var.y could not both be
active simultaneously, it was OK to leave var.x in the CPU's
floating-point registers and manipulate var.y in memory. Thus,
assignments to var.x did not affect var.y, and vice versa.


Since a union is defined as an overlapping set of member objects,
would you consider this a conforming compiler?

I believe so. The C standard only guarantees what you want (and
in this case, what I wanted, if only to avoid changing the code)
for "unsigned char" union members.
So, if we're going to resort to just using memcpy, why bother using the
union at all?

hval_t phi_bits;
double d = (sqrt(5.0) + 1.0) / 2.0;

memcpy(&phi_bits, &d, min(sizeof(hval_t), sizeof(double)));

Indeed, this should work fine. Stylistically, I would prefer:

memcpy(&phi_bits, &d, min(sizeof phi_bits, sizeof d));

since I prefer to name the objects, rather than their types, when
possible.
I mentioned something about trap representations and unions in another post
in this thread. I'm curious if I'm interpreting that correctly.

I have little to add on trap representations. The Univac 1100 had
them, but did not actually trap on them: some particular signed
bit patterns represented -0, which was only "sort of equal" to +0
and would sometimes give peculiar results for some operations.
Many machines today still have trap representations for floating
point (any signalling NaN). Few if any modern machines have trap
representations for any kind of integer, though.
 
T

Tim Rentsch

Chris Torek said:
union { double x; int y[2]; } var;

The compiler "knew" that since var.x and var.y could not both be
active simultaneously, it was OK to leave var.x in the CPU's
floating-point registers and manipulate var.y in memory. Thus,
assignments to var.x did not affect var.y, and vice versa.


Since a union is defined as an overlapping set of member objects,
would you consider this a conforming compiler?

I believe so. The C standard only guarantees what you want (and
in this case, what I wanted, if only to avoid changing the code)
for "unsigned char" union members.


Apparently this question has caused some confusion, and it's
supposed to be addressed by a TC. See:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm

My reading of the wording in the TC is that storing in one
member then accessing through another should (after the TC
takes effect) get the re-interpreted value that was stored
in the previously-stored-into member.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top