Floating point to integer casting

chad · Oct 13, 2009

Do you mean how can it happen, or when will it ever make
a difference? The answer for how it can happen is,
no matter what the range and precision are for (double)
(or (long double), for that matter), the implementation
is allowed to use greater range and precision for the
results of operations. So plus could be carried out
with 1024 bits of precision, say, or with more exponent
bits to give a greater range (or both). Extra bits
may be relevant because floating-point numbers might
be in different ranges (ie, have different exponents).

As to when will it ever make a difference, for this
simple example I think it depends on rounding modes.
Obviously for more complicated expressions, eg

a = b + c + d + e + f + g;

some extra precision could make a difference due to
carries when adding some small numbers and some bigger
ones. Extra range could also matter when adding
some positive numbers and some negative ones,
protecting against overflows in intermediate results.
I'm sure there must be other examples, and probably
better ones, but the ones here are just the first
ones that popped into my head.

I guess I meant to ask how could it happen. I couldn't figure out how
to phrase the question because my grammar isn't that strong. What can
I say. I should have probably paid more attention in my High School
English classes.

Seebs · Oct 13, 2009

Okay, I'm going to take the bait here. How can the plus operation be
computed with greater precision than double?

"long double". Or possibly an internal representation which doesn't map
onto any native type.

Think, say, the FPU on a 68000-series chip with FPU, early on. You could
have a "double" type which was 64 bits, and a "long double" type which was
80 bits -- and only have hardware support for the 80-bit type. Solution?
Do absolutely all calculations in 80-bit, then truncate when you had to store
the value.

This led to a possible problem: What if a helpful optimizer didn't bother
truncating the value before using it again? Well, then, you got "wrong"
results -- and yes, in floating point, "more precision than we expected"
can be "wrong".

-s

Tim Rentsch · Oct 13, 2009

Morris Keesan said:
(Example)

Some floating point hardware works internally using 80-bits, when
the precision of double is 64-bits, which can lead to
inconsistencies when intermediate 80-bit results are written to
memory as 64-bits then loaded again, compared with keeping the
intermediate values in the registers.

Click to expand...

I was going to say that the expression b + c has type (double), but after
looking in the standard for confirmation of this, I'm confused:

6.3.1.8 Usual arithmetic conversions

"Unless explicitly stated otherwise, the common real type is also
the corresponding real type of the result"
[so the result of b + c would have type double -- MK]
Right.

but I'm confused by paragraph 2 and its footnote, which say

"The values of floating operands and of the results of floating
expressions may be represented in greater precision and range
than that required by the type; the types are not changed thereby. 52)"
and "52) The cast and assignment operators are still required to perform
their specified conversions as described in 6.3.1.4 and 6.3.1.5."

What's meant by this? If "the types are not changed thereby", does this
mean that (b + c) has type double, or not? And if the type is not changed,
what conversion would be necessary to do the assignment to a?

It means, even though the value is represented in greater range and
precision (than (double), for this case), the type is still (double).

The conversion for assignment to 'a' is 'a = (double) (b+c)'.
I know it seems weird that converting an expression to the same
type as the expression can change its value, but that's the rule.

Furthermore, if the result of a floating expression can be "represented
in greater precision and range" than that required, what does this say
about sizeof(b + c)? What can we predict about the value of the expression

sizeof(b + c) == sizeof(double)

in conforming implementations? Can a strictly conforming program rely on
this having the value 1?

The type of (b+c) is still double, even if the result value is
represented with greater range or precision. The sizeof
comparison you wrote is indeed always 1 (assuming b and c are
doubles).

Or is this "greater range and precision" clause merely giving
implementations
permission to represent intermediate results in ways that could give
different results for more complicated floating expressions, e.g.
potentially
giving different results for

((double)(b + c)) - ((double)(e * f))
vs.
(b + c) - (e * f)

where b, c, e, and f are all doubles?

Yes, the point is to give implementation more freedom for
intermediate results, and there is a good chance that these two
expressions will have different values, because casting to
(double) forces any extra range and/or precision of the two
intermediate values (that are operands to '-') to be discarded.

Phil Carmody · Oct 13, 2009

Morris Keesan said:
....
I was going to say that the expression b + c has type (double), but after
looking in the standard for confirmation of this, I'm confused:

6.3.1.8 Usual arithmetic conversions

"Unless explicitly stated otherwise, the common real type is also
the corresponding real type of the result"
[so the result of b + c would have type double -- MK]
Agreed.

but I'm confused by paragraph 2 and its footnote, which say

"The values of floating operands and of the results of floating
expressions may be represented in greater precision and range
than that required by the type; the types are not changed
thereby. 52)"
and "52) The cast and assignment operators are still required to perform
their specified conversions as described in 6.3.1.4 and 6.3.1.5."

What's meant by this? If "the types are not changed thereby", does this
mean that (b + c) has type double, or not? And if the type is not changed,
what conversion would be necessary to do the assignment to a?

It's of type double but may represented by something with greater
precision than double. Until stuck in an actual double.

Furthermore, if the result of a floating expression can be "represented
in greater precision and range" than that required, what does this say
about sizeof(b + c)? What can we predict about the value of the expression

sizeof(b + c) == sizeof(double)

in conforming implementations? Can a strictly conforming program rely on
this having the value 1?

I think the std. requires some clarification on that. The lack of
distinction between expressions and types leaves the above unclear.
If the description had stated that all expressions are first treated
as their type, then you'd be mapped back to double type, and the
confusion would disappear.

Or is this "greater range and precision" clause merely giving
implementations
permission to represent intermediate results in ways that could give
different results for more complicated floating expressions,
e.g. potentially
giving different results for

((double)(b + c)) - ((double)(e * f))
vs.
(b + c) - (e * f)

where b, c, e, and f are all doubles?

I have several archs here where i'd expect to trivially be able to
come up with values for b, c, e, and f which would yield different
values for those two expressions. The x86-based ones, if using the
FPU, because of higher precision intermediates, and the others (POWER,
Arm) because of fused exact multiply-add instructions. Anything which
has catastrophic cancellation should work.

Phil

Phil Carmody · Oct 13, 2009

Seebs said:
"long double". Or possibly an internal representation which doesn't map
onto any native type.

Think, say, the FPU on a 68000-series chip with FPU, early on. You could
have a "double" type which was 64 bits, and a "long double" type which was
80 bits -- and only have hardware support for the 80-bit type. Solution?
Do absolutely all calculations in 80-bit, then truncate when you had to store
the value.

This led to a possible problem: What if a helpful optimizer didn't bother
truncating the value before using it again? Well, then, you got "wrong"
results -- and yes, in floating point, "more precision than we expected"
can be "wrong".

But don't forget that "exactly the precision we expected" can be
"wrong" for even simpler reasons. That's why if you want numeric
work, you get someone skilled in the field, who can manage the
various wrongs appropriately.

Phil

Tim Rentsch · Oct 13, 2009

Phil Carmody said:
I was going to say that the expression b + c has type (double), but after
looking in the standard for confirmation of this, I'm confused:

6.3.1.8 Usual arithmetic conversions

"Unless explicitly stated otherwise, the common real type is also
the corresponding real type of the result"
[so the result of b + c would have type double -- MK]
Agreed.

but I'm confused by paragraph 2 and its footnote, which say

"The values of floating operands and of the results of floating
expressions may be represented in greater precision and range
than that required by the type; the types are not changed
thereby. 52)"
and "52) The cast and assignment operators are still required to perform
their specified conversions as described in 6.3.1.4 and 6.3.1.5."

What's meant by this? If "the types are not changed thereby", does this
mean that (b + c) has type double, or not? And if the type is not changed,
what conversion would be necessary to do the assignment to a?

Click to expand...

It's of type double but may represented by something with greater
precision than double. Until stuck in an actual double.

Furthermore, if the result of a floating expression can be "represented
in greater precision and range" than that required, what does this say
about sizeof(b + c)? What can we predict about the value of the expression

sizeof(b + c) == sizeof(double)

in conforming implementations? Can a strictly conforming program rely on
this having the value 1?

Click to expand...

I think the std. requires some clarification on that. The lack of
distinction between expressions and types leaves the above unclear.
If the description had stated that all expressions are first treated
as their type, then you'd be mapped back to double type, and the
confusion would disappear.

The type of the expression (b+c) is double, as already noted:
"the types are not changed thereby." The sizeof operator works
on types: "The size is determined from the type of the operand."
(6.5.3.4p2). The Standard doesn't leave any wiggle room: the
two types are the same so their sizes are the same; the result
is well-defined and must be equal to 1.

Dik T. Winter · Oct 13, 2009

> I have several archs here where i'd expect to trivially be able to
> come up with values for b, c, e, and f which would yield different
> values for those two expressions. The x86-based ones, if using the
> FPU, because of higher precision intermediates, and the others (POWER,
> Arm) because of fused exact multiply-add instructions. Anything which
> has catastrophic cancellation should work.

Note also that with a fused multiply-add the expression
a * b + c * d == c * d + a * b
is not necessarily true.

Phil Carmody · Oct 14, 2009

Tim Rentsch said:
The type of the expression (b+c) is double, as already noted:
"the types are not changed thereby." The sizeof operator works
on types: "The size is determined from the type of the operand."
(6.5.3.4p2). The Standard doesn't leave any wiggle room: the
two types are the same so their sizes are the same; the result
is well-defined and must be equal to 1.

Yup, the standard's off the hook, that clause is unambiguous.

Phil

Phil Carmody · Oct 14, 2009

Dik T. Winter said:
Note also that with a fused multiply-add the expression
a * b + c * d == c * d + a * b
is not necessarily true.

I can imagine that
a * b + c * d == a * b + c * d
is not necessarily true either.

I've seen Apple's (POWER) gcc get confused with register allocations
before, and wouldn't put it past it to change the order of evaluation
of sub-expressions if it got confused.

Phil

Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Program to find the largest integer element of an array.	1	Mar 2, 2022
Floating point linkage	37	Oct 13, 2013
Need help! Following code isnt working fully Comparison of integer and pointer	0	Nov 20, 2022
integer and floating point casts queries	14	Dec 8, 2008
Sequence point.	10	Sep 30, 2013
Trouble with integer floating point conversion	49	Dec 12, 2007
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024

Floating point to integer casting

chad

Seebs

Tim Rentsch

Phil Carmody

Phil Carmody

Tim Rentsch

Dik T. Winter

Phil Carmody

Phil Carmody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads