Floating Point and Wide Registers

Dik T. Winter · Aug 26, 2006

> Should it be 2.0 > DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
> general since we have greater e-max than p, the precision matters when
> inspecting whether or not a positive integer can be represented with
> the given radix.

If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p, else it is xxx_MAX.
If e_max >= p and assuming p > 0, the first formula gives at least 2.
So if 2.0 is not representable, we have e_max < p, and the additional
requirement that 2.0 > xxx_MAX.

I ignored e-min in the analysis. But if e_min > p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min > p, the smallest positive integer that is
representable is b ** (e_min - p).

ena8t8si · Aug 26, 2006

Robert said:
6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Yes.

However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */
if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}

ena8t8si · Aug 26, 2006

Dik said:
If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p, else it is xxx_MAX.
If e_max >= p and assuming p > 0, the first formula gives at least 2.
So if 2.0 is not representable, we have e_max < p, and the additional
requirement that 2.0 > xxx_MAX.

I ignored e-min in the analysis. But if e_min > p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min > p, the smallest positive integer that is
representable is b ** (e_min - p).

The restrictions in 5.2.4.2.2 #10 imply that e_min <= 0.

Dik T. Winter · Aug 27, 2006

> Dik T. Winter wrote: ....
>
> The restrictions in 5.2.4.2.2 #10 imply that e_min <= 0.

Indeed, I did not look far enough.

ena8t8si · Aug 27, 2006

Dik said:
Indeed, I did not look far enough.

Well it's easy to overlook. The first time
reading through it I had reached the same
conclusion you did.

Jun Woong · Aug 28, 2006

Dik said:
If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

I was talking about an integer s.t. integers from 1 to that integer
(inclusive) can be represented exactly with given b, e_max, e_min and
p, which the following deals with.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p,

n should be (b ** p) - 1 if considering integers from 1 to n
inclusive.

else it is xxx_MAX.

However, if e_max < p, then xxx_MAX == (1 - b**(-p)) * b**e_max is not
an integer. It should be (b ** e_max) - 1 or
xxx_MAX / (1 - xxx_EPSILON/FLT_RADIX) - 1

The factor (FLT_RADIX-DBL_EPSILON)**(-1) in my previous post came from
my mistake made in handling the floor function.

Thanks.

Jun Woong · Aug 28, 2006

However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */

Correct whether or not the implementation's fp number model follows
the standard's.

if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

Allowing for implementations which does not follow the standard's fp
number model makes many things vague on this area. I doubt
consideration for such poor implementations (if any) is still
necessary.

Dik T. Winter · Aug 28, 2006

> Dik T. Winter wrote:
>
>
> n should be (b ** p) - 1 if considering integers from 1 to n
> inclusive.

Why? b ** p - 1 is representable, as is b ** p. So, I think that
b ** p should be included.

Jun Woong · Aug 29, 2006

Dik T. Winter wrote:
[...]

Why? b ** p - 1 is representable, as is b ** p. So, I think that
b ** p should be included.

Oops, you are right. I missed e_max >= p, so b ** p is also
representable.

ena8t8si · Sep 3, 2006

Jun said:
Correct whether or not the implementation's fp number model follows
the standard's.

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

1. The assumption that the implementation follows the standard's
fp number model isn't necessary. My comment does assume that 1
is exactly representable, but beyond that any number model will
work.

2. The accuracy of subtraction could indeed be arbitrarily
bad. However, DBL_EPSILON is defined as the difference
between 1 and the smallest double value greater than 1.
If that is meant as the implementation does the subtraction
then the second half of the guarantee is true by definition.

Allowing for implementations which does not follow the standard's fp
number model makes many things vague on this area. I doubt
consideration for such poor implementations (if any) is still
necessary.

And the assumptions here are weaker even than that, only that 1
is exactly representable, the difference between 1 and 1+DBL_EPSILON
is exactly representable, and if the result of a subtraction is
exactly representable then the subtraction yields that value.

Jun Woong · Sep 4, 2006

Jun Woong wrote: [...]

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

Click to expand...

[...]
And the assumptions here are weaker even than that, only that 1
is exactly representable, the difference between 1 and 1+DBL_EPSILON
is exactly representable, and if the result of a subtraction is
exactly representable then the subtraction yields that value.

Yes.

I didn't mean the precise assumption to make your argument true; what
I listed are the sufficient conditions for it.

One thing to note is that the fp number model and the accuracy of the
subtraction operation is separate; that is, the definition of
*_EPSILON does not restrict the result of x - 1 to be *_EPSILON where
x indicates 1's succeeding number on the representable fp number line.

Fixed precision floating point and locale facets	4	Nov 5, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003

Floating Point and Wide Registers

Dik T. Winter

ena8t8si

ena8t8si

Dik T. Winter

ena8t8si

Jun Woong

Jun Woong

Dik T. Winter

Jun Woong

ena8t8si

Jun Woong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads