Floating Point and Wide Registers

Walter Roberson · Aug 22, 2006

Keith Thompson said:
Your reading would forbid different rounding methods for float and
double, which I don't think is the intent.

On the contrary, I believe different rounding methods for float
and double is an intentional allowance.

"single precision" floating point formats and rounding properties
were invented multiple times, and it took years for the good
and the bad to get weeded out from the ugly. "double precision"
came along much further in the development of floating point,
when there were much firmer ideas of How It Ought To Work.
Double precision formats and properties were much more
standardized than single precision. It was not uncommon for
manufacturers to preserve their legacy single precision properties
for single precision, but to use different properties for
double precision.

If I recall correctly, IBM and DEC (especially VAX) had multiple
legacy single-precision float formats with properties that differed
noticably from double-precision properties.

Robert Gamble · Aug 22, 2006

Douglas said:
The implementation definition could be arbitrarily complicated,
specifying variations based on context (for example).

Okay, I am in agreement now.

Robert Gamble

Robert Gamble · Aug 22, 2006

Robert said:
Robert said:

How would you represent 2.0 with a radix of 3 in the floating point
model?

Click to expand...

As indicated above, 0.200000e+1. In terms of 5.2.4.2.2:

s = +1
b = 3
e = 1
f[1] = 2, all other f[k] = 0

The value give by the formula in 5.2.4.2.2p2 is then

x = +1*3*2*3^-1 == 2.0

You and Richard are, of course, correct. I don't know what I was
thinking, sorry and thanks for the clarification.

Robert Gamble

Robert Gamble · Aug 22, 2006

Douglas said:
I don't think it's guaranteed, even if the declarations were
volatile-qualified (to prevent register caching). However,
it's hard to imagine code in that case that would fail the test.

First off I'd like to thank you and everyone else who has contributed
to this thread, your patience and insights have been valuable and are
appreciated.
I accept the fact that not-withstanding IEEE-compliance 0.1==0.1 is not
guaranteed and all of the related points that lead to such a
conclusion. What I don't understand though is how the above example
isn't guaranteed, even without the volatile qualifier. In my
understanding the value of 0.1 is stored, either exactly or rounded in
an implementation-defined way, as a double value in d1. How can
additional rounding occur when d2 is then assigned the value of d1? I
really can't think of an allowable scenerio where this could be the
case. I understand that:
d1 = 0.1; d2 = 0.1;
may not result in d1 and d2 having values that compare equal but this
is because there is the potential for rounding to occur twice with the
results being different each time. Similiar to my original example, I
would think that "d2 = d1 = 0.1;" would also result in values for d1
and d2 that must compare equal. Not only that, but given the apparent
guarantees of 6.3.1.5, I would think that the following is also always
true:
float f1 = 0.1;
double d1 = f1;
f1 == d1;
The Standard seems pretty clear about this, or am I misinterpreting
something here?

Robert Gamble

Dik T. Winter · Aug 22, 2006

>
> Where is that stated?

5.2.4.2.2 where the model is defined. 1.0 is a number in the model.
The actual representation may have numbers in addition to the model
numbers, but the model numbers are required. See also the definition
of FLT_EPSILON.

Dik T. Winter · Aug 22, 2006

>
> Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
> the exponent expressed in the base as well, 0.100000e+10), and exactly
> 0.200000e+1 if the base is anything larger.

FLT_DIG is required to be at least 6. It is further defined as
floor((p - 1) * log_10 b) (ignoring the case b is a power of 10).
Which means that (p - 1) * log_10 b > 6, or p > 1 + 6 / log_10 b.
We need not consider the exponent, that is can be large enough.
So the largest integer that is guaranteed to be represented
exactly is:
b ** (1 + 6 / log_10 b) - 1 = b * b ** (6 * log_b 10) - 1 =
b * (b ** 10) ** 6 - 1 = b * 10 ** 6 - 1 = b * 1000000 - 1
(again, ** means exponentiation).
For b = 10 it the (different) formulas come to the result that
maximum is 999999.
So considering everything, all integers in the range -999999 .. +999999
can be represented exactly, regardless the base used.

Dik T. Winter · Aug 22, 2006

> No, the choices are "the larger representable value immediately
> adjacent to the nearest representable value" and "the smaller
> representable value immediately adjacent to the nearest representable
> value"; "Larger for right operands" is not a valid choice.

There are actually *three* choices. The nearest representable value,
or one of the two values adjacent to it.

> The
> implementation needs to choice between which of these behaviors to use
> and document it.

But the choice need not be consistent with respect to the number or the
context.

Dik T. Winter · Aug 22, 2006

> And just to muddy the waters a bit more, it will actually improve the
> accuracy of calculations (in the long-term average) if 1/2 LSB bit of
> random dither is thrown in.

Try to discuss that with Kahan.

Dik T. Winter · Aug 22, 2006

> For the model defined in 5.2.4.2.2, there do exist values of b, p,
> emin, and emax such that 2.0 isn't exactly representable: if e-min is
> high enough, 2.0 < DBL_MIN; if e-max were low enough, 2.0 >
> DBL_EPSILON*DBL_MAX. but that would require DBL_MIN, and either
> DBL_EPSILON or DBL_MAX, to have values inconsistent with
> 5.2.4.2.2p8-10.
>
> Any implementation where 2.0 was either too large or too small to be
> represented exactly would also be pretty unpopular, but that's a QoI
> issue.

In my opinion such an implementation would be incorrect. For float,
FLT_MIN_10_EXP is *defined* as ceil((e_min - 1) * log_10 b) and should
be at most -37. This gives an upper bound on e_min. In a similar way
we can find a lower bound on e_max. And both are good enough to
guarantee that 2.0 *must* be in the model. Note that the values in
this paragraph are *minimal* in magnitude.

kuyper · Aug 22, 2006

Dik said:
In my opinion such an implementation would be incorrect. For float,
FLT_MIN_10_EXP is *defined* as ceil((e_min - 1) * log_10 b) and should
be at most -37. This gives an upper bound on e_min. In a similar way
we can find a lower bound on e_max. And both are good enough to
guarantee that 2.0 *must* be in the model. Note that the values in
this paragraph are *minimal* in magnitude.

Of course - I already indicated that such an implementation was
non-conforming. My point was that, in addition to being non-conforming,
it would also be extremely unpopular. As a practical matter, that's far
more important.

Jun Woong · Aug 23, 2006

Dik said:
5.2.4.2.2 where the model is defined. 1.0 is a number in the model.
The actual representation may have numbers in addition to the model
numbers, but the model numbers are required. See also the definition
of FLT_EPSILON.

There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

Jun Woong · Aug 23, 2006

Douglas said:
I don't think it's guaranteed, even if the declarations were
volatile-qualified (to prevent register caching). However,
it's hard to imagine code in that case that would fail the test.

Exactly what part of the standard leaves it not guaranteed? It is hard
for me to imagine a case where the equlity comparison does not hold.
The given code should differ from

double d1 = 0.1;
double d2 = 0.1;
d1 == d2;

Richard Bos · Aug 23, 2006

Dik T. Winter said:
FLT_DIG is required to be at least 6.

I know. Count the decimals

Richard

Jun Woong · Aug 23, 2006

Robert said:
I think this part still stands. What I really care about is that the
following never aborts for any ordered double values of d1 and d2:

if (d1 < d2) {
if (d1 < d2)
;
else
abort();
}

This would allow for sorting arrays of floating point values, is this
guaranteed?

I think so; but I still wonder if somebody can imagine a conforming
implementation where this does not hold. As you know, if you replace
d1 and d2 with fp constants (even when they are the same constant)
the result can differ; this is never intuitive, but what the
standard says.

ena8t8si · Aug 24, 2006

Robert said:
First off I'd like to thank you and everyone else who has contributed
to this thread, your patience and insights have been valuable and are
appreciated.
I accept the fact that not-withstanding IEEE-compliance 0.1==0.1 is not
guaranteed and all of the related points that lead to such a
conclusion. What I don't understand though is how the above example
isn't guaranteed, even without the volatile qualifier. In my
understanding the value of 0.1 is stored, either exactly or rounded in
an implementation-defined way, as a double value in d1. How can
additional rounding occur when d2 is then assigned the value of d1? I
really can't think of an allowable scenerio where this could be the
case. I understand that:
d1 = 0.1; d2 = 0.1;
may not result in d1 and d2 having values that compare equal but this
is because there is the potential for rounding to occur twice with the
results being different each time. Similiar to my original example, I
would think that "d2 = d1 = 0.1;" would also result in values for d1
and d2 that must compare equal.
Right.

Not only that, but given the apparent
guarantees of 6.3.1.5, I would think that the following is also always
true:
float f1 = 0.1;
double d1 = f1;
f1 == d1;
The Standard seems pretty clear about this, or am I misinterpreting
something here?

Your analysis is correct. A careful reading of the relevant
sections shows that equality must hold.

Robert Gamble · Aug 24, 2006

Jun said:
There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Robert Gamble

zebedee · Aug 24, 2006

Robert said:
I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed.

The only difference between your two examples is one has an "implicit
conversion" and one has an "explicit conversion".

Given the quoted clause below (emphasis mine), I don't believe you are
justified in believing either can give a result the other can't.

Neil.

6.3 Conversions
1 Several operators convert operand values from one type to another
automatically. This subclause specifies the result required from such an
implicit conversion, *as well as* those that result from a cast
operation (an explicit conversion). The list in 6.3.1.8 summarizes the
conversions performed by most ordinary operators; it is supplemented as
required by the discussion of each operator in 6.5.
2 *Conversion* of an operand value to a compatible type causes no change
to the value or the representation.

Jun Woong · Aug 24, 2006

Robert said:
Jun Woong wrote: [...]

There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

Click to expand...

6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Yes, I think so. My point was that there is no guarantee for an
implemention to represent 1.0 exactly even if the standard's fp number
model has 1.0 in it. I meant to exclude 1 (an integer constant) with
the phrase "with the fp number model."

Jun Woong · Aug 24, 2006

Jun said:
Robert said:

Jun Woong wrote: [...]

There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

Click to expand...

6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Click to expand...

Yes, I think so. My point was that there is no guarantee for an
implemention to represent 1.0 exactly even if the standard's fp number
model has 1.0 in it. I meant to exclude 1 (an integer constant) with
the phrase "with the fp number model."

I think we missed an important fact on the fp number model. My answers
and possibly some of other's are true only when an implementation
strictly conforms to the fp number model the standard provides. You
might think that if it does not conform to the model then it should be
a non-conforming implementation, but that's not the case. (the
following is not the only evidence, I remember a committee member on
the fp area of the standard confirmed the intent.)

DR025 for C90 says in part:

- Implementations are allowed considerable latitude in the way they
represent floating-point quantities; in particular, as noted in
Footnote 10 on page 14, the implementation need not exactly conform
to the model given in subclause 5.2.4.2.2 for ``normalized floating-
point numbers.''

and also from DR233 for C99:

- If there is no implementation representation of ZERO, but rather a
very small number. In this case, we generally thought that this was
a user problem, that they could not rely on a true ZERO having a
representation, in which case, they would need to place their own
checks for what approximations were acceptable as ZERO and print a
literal instead.

Of course, for zero the answers to DR025 and DR233 seem to conflict,

- (from DR025) There shall be at least one exact representation for
the value zero.

but the committee didn't forget to note in DR025 that:

- they[the principles some of which are quoted above] are not meant
to impose additional constraints on conforming implementations

So even if the fp number model should include 1.0, a conforming
implementation is allowed to have no way to represent it exactly,
so on such an implementation the above test for equality need not
hold.

Jun Woong · Aug 25, 2006

Robert said:
Robert said:

How would you represent 2.0 with a radix of 3 in the floating point
model?

Click to expand...

As indicated above, 0.200000e+1. In terms of 5.2.4.2.2:

s = +1
b = 3
e = 1
f[1] = 2, all other f[k] = 0

The value give by the formula in 5.2.4.2.2p2 is then

x = +1*3*2*3^-1 == 2.0

For the model defined in 5.2.4.2.2, there do exist values of b, p,
emin, and emax such that 2.0 isn't exactly representable: if e-min is
high enough, 2.0 < DBL_MIN; if e-max were low enough, 2.0 >
DBL_EPSILON*DBL_MAX.

Should it be 2.0 > DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
general since we have greater e-max than p, the precision matters when
inspecting whether or not a positive integer can be represented with
the given radix.

Fixed precision floating point and locale facets	4	Nov 5, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 3, 2004
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003

Floating Point and Wide Registers

Walter Roberson

Robert Gamble

Robert Gamble

Robert Gamble

Dik T. Winter

Dik T. Winter

Dik T. Winter

Dik T. Winter

Dik T. Winter

kuyper

Jun Woong

Jun Woong

Richard Bos

Jun Woong

ena8t8si

Robert Gamble

zebedee

Jun Woong

Jun Woong

Jun Woong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads