Floating Point and Wide Registers

D

Dik T. Winter

> Should it be 2.0 > DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
> general since we have greater e-max than p, the precision matters when
> inspecting whether or not a positive integer can be represented with
> the given radix.

If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p, else it is xxx_MAX.
If e_max >= p and assuming p > 0, the first formula gives at least 2.
So if 2.0 is not representable, we have e_max < p, and the additional
requirement that 2.0 > xxx_MAX.

I ignored e-min in the analysis. But if e_min > p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min > p, the smallest positive integer that is
representable is b ** (e_min - p).
 
E

ena8t8si

Robert said:
6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Yes.

However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */
if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}
 
E

ena8t8si

Dik said:
If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p, else it is xxx_MAX.
If e_max >= p and assuming p > 0, the first formula gives at least 2.
So if 2.0 is not representable, we have e_max < p, and the additional
requirement that 2.0 > xxx_MAX.

I ignored e-min in the analysis. But if e_min > p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min > p, the smallest positive integer that is
representable is b ** (e_min - p).

The restrictions in 5.2.4.2.2 #10 imply that e_min <= 0.
 
J

Jun Woong

Dik said:
If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

I was talking about an integer s.t. integers from 1 to that integer
(inclusive) can be represented exactly with given b, e_max, e_min and
p, which the following deals with.
To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p,

n should be (b ** p) - 1 if considering integers from 1 to n
inclusive.
else it is xxx_MAX.

However, if e_max < p, then xxx_MAX == (1 - b**(-p)) * b**e_max is not
an integer. It should be (b ** e_max) - 1 or
xxx_MAX / (1 - xxx_EPSILON/FLT_RADIX) - 1

The factor (FLT_RADIX-DBL_EPSILON)**(-1) in my previous post came from
my mistake made in handling the floor function.

Thanks.
 
J

Jun Woong

However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */

Correct whether or not the implementation's fp number model follows
the standard's.
if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

Allowing for implementations which does not follow the standard's fp
number model makes many things vague on this area. I doubt
consideration for such poor implementations (if any) is still
necessary.
 
D

Dik T. Winter

> Dik T. Winter wrote:
>
>
> n should be (b ** p) - 1 if considering integers from 1 to n
> inclusive.

Why? b ** p - 1 is representable, as is b ** p. So, I think that
b ** p should be included.
 
J

Jun Woong

Dik T. Winter wrote:
[...]
Why? b ** p - 1 is representable, as is b ** p. So, I think that
b ** p should be included.

Oops, you are right. I missed e_max >= p, so b ** p is also
representable.
 
E

ena8t8si

Jun said:
Correct whether or not the implementation's fp number model follows
the standard's.


I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

1. The assumption that the implementation follows the standard's
fp number model isn't necessary. My comment does assume that 1
is exactly representable, but beyond that any number model will
work.

2. The accuracy of subtraction could indeed be arbitrarily
bad. However, DBL_EPSILON is defined as the difference
between 1 and the smallest double value greater than 1.
If that is meant as the implementation does the subtraction
then the second half of the guarantee is true by definition.
Allowing for implementations which does not follow the standard's fp
number model makes many things vague on this area. I doubt
consideration for such poor implementations (if any) is still
necessary.

And the assumptions here are weaker even than that, only that 1
is exactly representable, the difference between 1 and 1+DBL_EPSILON
is exactly representable, and if the result of a subtraction is
exactly representable then the subtraction yields that value.
 
J

Jun Woong

Jun Woong wrote: [...]
I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model
[...]
And the assumptions here are weaker even than that, only that 1
is exactly representable, the difference between 1 and 1+DBL_EPSILON
is exactly representable, and if the result of a subtraction is
exactly representable then the subtraction yields that value.

Yes.

I didn't mean the precise assumption to make your argument true; what
I listed are the sufficient conditions for it.

One thing to note is that the fp number model and the accuracy of the
subtraction operation is separate; that is, the definition of
*_EPSILON does not restrict the result of x - 1 to be *_EPSILON where
x indicates 1's succeeding number on the representable fp number line.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,585
Members
45,080
Latest member
mikkipirss

Latest Threads

Top