Can a double always represent an int exactly?

F

Fred Ma

I'm using the expression "int a = ceil( SomeDouble )".
The man page says that ceil returns the smallest
integer that is not less than SomeDouble, represented
as a double. However, my understanding is that a
double has nonuniform precision throughout its value
range. Will a double always be able to exactly
represent any value of type int? Could someone please
point me to an explanation of how this is ensured,
given that the details of a type realization varies
with the platform?

Thanks.

Fred

P.S. I am not worried about overflowing the int value
range, just about the guaranteed precise representation
of int by double.
 
R

Rouben Rostamian

I'm using the expression "int a = ceil( SomeDouble )".
The man page says that ceil returns the smallest
integer that is not less than SomeDouble, represented
as a double. However, my understanding is that a
double has nonuniform precision throughout its value
range. Will a double always be able to exactly
represent any value of type int? Could someone please
point me to an explanation of how this is ensured,
given that the details of a type realization varies
with the platform?

I don't know whether the C Standard specifies anything to
this effect. But here is an implementation-specific
observation.

On a machine with 64-bit doubles which follow the IEEE
specification, the mantissa part is 53 bits (plus one hidden
bit as well) therefore integers as large as around 2 to the
50th power should be exactly representable. In particular, if
the machine has 32-bit ints, they are all exactly representable
as doubles.

On my machine, which has 32-bit ints and 64-bit doubles,
the following yields the exact answer:

printf("%30.15f\n", 1.0 + pow(2.0, 52.));

However the following stretches it too far and the answer
is inexact:

printf("%30.15f\n", 1.0 + pow(2.0, 53.));
 
F

Fred Ma

Rouben said:
I don't know whether the C Standard specifies anything to
this effect. But here is an implementation-specific
observation.

On a machine with 64-bit doubles which follow the IEEE
specification, the mantissa part is 53 bits (plus one hidden
bit as well) therefore integers as large as around 2 to the
50th power should be exactly representable. In particular, if
the machine has 32-bit ints, they are all exactly representable
as doubles.

On my machine, which has 32-bit ints and 64-bit doubles,
the following yields the exact answer:

printf("%30.15f\n", 1.0 + pow(2.0, 52.));

However the following stretches it too far and the answer
is inexact:

printf("%30.15f\n", 1.0 + pow(2.0, 53.));

I realize that if a double actually uses twice as many bits as
ints, the mantissa should be big enough that imprecision should
never arise. I'm just concerned about whether this can be relied
upon. My faith in what seems normal has been shaken after finding
that long has the same number of bits as int in some environments.
What if double has the same number of bits as ints in some
environments? Some of those bits will be taken up by the
exponent, and the mantissa will actually have fewer bits than an
int. Hence, it will be less precise than ints within the value
range of ints.

Fred
 
C

Chris Torek

I'm using the expression "int a = ceil( SomeDouble )". The man
page says that ceil returns the smallest integer that is not less
than SomeDouble, represented as a double. However, my understanding
is that a double has nonuniform precision throughout its value range.

This is correct (well, I can imagine a weird implementation that
deliberately makes "double"s have constant precision by often
wasting a lot of space; it seems quite unlikely though).

Note that ceil() returns a double, not an int.
Will a double always be able to exactly represent any value of
type int?

This is implementation-dependent. If "double" is not very precise
but INT_MAX is very large, it is possible that not all "int"s can
be represented. This is one reason ceil() returns a double (though
a small one at best -- the main reason is so that ceil(1.6e35) can
still be 1.6e35, for instance).
Could someone please point me to an explanation of how this is ensured,
given that the details of a type realization varies with the platform?

I am not sure what you mean by "this", especially with the PS:
P.S. I am not worried about overflowing the int value
range, just about the guaranteed precise representation
of int by double.

.... but let me suppose you are thinking of a case that actually occurs
if we substitute "float" for "double" on most of today's implementations.
Here, we get "interesting" effects near 8388608.0 and 16777216.0.
Values below 16777216.0 step by ones: 8388608.0 is followed
immediately by 8388609.0, for instance, and 16777215.0 is followed
immediately by 16777216.0. On the other hand, below (float)(1<<23)
or above (float)(1<<24), we step by 1/2 or 2 respectively. Using
nextafterf() (if you have it) and variables set to the right values,
you might printf() some results and find:

nextafterf(8388608.0, -inf) = 8388607.5
nextafterf(16777216.0, +inf) = 16777216.2

So all ceil() has to do with values that are at least 8388608.0
(in magnitude) is return those values -- they are already integers.
It is only values *below* this area that can have fractional
parts.

Of course, when we use actual "double"s on today's real (IEEE style)
implementations, the tricky point is not 2-sup-23 but rather
2-sup-52. The same principal applies, though: values that meet or
exceed some magic constant (in either positive or negative direction)
are always integral, because they have multiplied away all their
fraction bits by their corresponding power of two. Since 2-sup-23 +
2-sup-22 + ... + 2-sup-0 is a sum of integers, it must itself be
an integer. Only if the final terms of the sum involve negative
powers of two can it contain fractions.

The other "this" you might be wondering about is: how do you
drop off the fractional bits? *That* one depends (for efficiency
reasons) on the CPU. The two easy ways are bit-twiddling, and
doing addition followed by subtraction. In both cases, we just
want to zero out any mantissa (fraction) bits that represent
negative powers of two. The bit-twiddling method does it with
the direct and obvious way: mask them out. The add-and-subtract
method uses the normalization hardware to knock them out. If
normalization is slow (e.g., done in software or with a microcode
loop), the bit-twiddling method is generally faster.
 
G

Gordon Burditt

I'm using the expression "int a = ceil( SomeDouble )".
The man page says that ceil returns the smallest
integer that is not less than SomeDouble, represented
as a double. However, my understanding is that a
double has nonuniform precision throughout its value
range. Will a double always be able to exactly
represent any value of type int?

No. There is nothing prohibiting an implementation from choosing
int = 64-bit signed integer, and double = 64-bit IEEE double, which
has only 53 mantissa bits. Integers outside the range +/- 2**53
may be rounded.
Could someone please
point me to an explanation of how this is ensured,
given that the details of a type realization varies
with the platform?

It is NOT ensured.

Gordon L. Burditt
 
E

Erik Trulsson

Fred Ma said:
I'm using the expression "int a = ceil( SomeDouble )".
The man page says that ceil returns the smallest
integer that is not less than SomeDouble, represented
as a double. However, my understanding is that a
double has nonuniform precision throughout its value
range.

I am not sure what you mean here, but a double is a floating-point type
and like all such has a precision of some fixed number of significant
digits. This precision does not vary, but for large exponents the
difference between one number and the next higher one can be fairly
large.
Will a double always be able to exactly
represent any value of type int?

Not necessarily. If, as is common, a double is 64 bits wide with 53
bits of precision, and (as is less common) int is also 64 bits wide
then there are some values of type int which can not be exactly
represented by a double.
 
E

Erik Trulsson

Fred Ma said:
I realize that if a double actually uses twice as many bits as
ints, the mantissa should be big enough that imprecision should
never arise. I'm just concerned about whether this can be relied
upon.

This can't be relied upon.
My faith in what seems normal has been shaken after finding
that long has the same number of bits as int in some environments.

Actually in most environments these days. (Most Unix-variants on
32-bit systems has both int and as 32 bits wide.)
What if double has the same number of bits as ints in some
environments? Some of those bits will be taken up by the
exponent, and the mantissa will actually have fewer bits than an
int. Hence, it will be less precise than ints within the value
range of ints.

Correct, and this can indeed happen.
 
J

Jack Klein

I'm using the expression "int a = ceil( SomeDouble )".
The man page says that ceil returns the smallest
integer that is not less than SomeDouble, represented
as a double. However, my understanding is that a
double has nonuniform precision throughout its value
range. Will a double always be able to exactly
represent any value of type int? Could someone please
point me to an explanation of how this is ensured,
given that the details of a type realization varies
with the platform?

Thanks.

Fred

P.S. I am not worried about overflowing the int value
range, just about the guaranteed precise representation
of int by double.

As others have mentioned, on 64-bit platforms some integer types, and
perhaps even type int on some, have 64 bits and doubles usually have
fewer mantissa bits than this.

What I haven't seen anyone else point out, so far, is the fact that
this implementation-defined characteristic is available to your
program via the macros DECIMAL_DIG and DBL_DIG in <float.h>.
 
F

Fred Ma

Fred said:
I'm using the expression "int a = ceil( SomeDouble )". The man page says
that ceil returns the smallest integer that is not less than SomeDouble,
represented as a double. However, my understanding is that a double has
nonuniform precision throughout its value range. Will a double always be
able to exactly represent any value of type int? Could someone please
point me to an explanation of how this is ensured, given that the details
of a type realization varies with the platform?

Thanks.

Fred

P.S. I am not worried about overflowing the int value range, just about
the guaranteed precise representation of int by double.


Thanks, all, for your replies. They have pointed out a flaw with my own
question. Specifically, it is one thing to ask:

(1) if a double can precisely represent any int.

It is quite another to ask:

(2) if an int(ceil(SomeDouble)) can precisely represent the smallest
integer that is no smaller than SomeDouble, given that SomeDouble is
in the value range of int.

The answer to #1 is clearly no if the mantissa of the double has
"significantly" fewer bits than the int. The reason for "significantly" is
approximate bookkeeping I've walked through; based on Chris's description,
I tried to sanity check this. It starts with the idea that whether a
double can represent any int depends on whether a double can increase in
value by exactly 1 throughout the value range of int. That is, when the
LSB of the mantissa is toggled, does the value of the double change by no
more than 1? For a mantissa of N bits, ignoring the IEEE hidden bit, this
condition is satisfied if scaling due to the exponent (power of 2) is
less-than-or-equal-to 2^N. I'm not talking about how the exponent is
represented in terms of bits; I'm talking about multiplying the mantissa by
2^N, however it is represented in IEEE format. Bascially, the scaling is
such that there are no fractional bits. An exponent value greater than N
yields a scaling that causes the double to increment by more than 1 when
the mantissa increments. Hence, the limiting condition for the double to
have a precision of unity is when the scaling is 2^N. The maximum number
under this condition is when the mantissa is all-ones (N+1 ones including
the hidden bit) i.e. the double has value 2^(N+2)-1. (I'm ignoring the
details to accommodate negative numbers, this might affect the answer by a
bit or so). If all ints fall within this limit, then a double can
represent all ints.

I think the answer to #2 follows from this picture of scaling the mantissa
so that the LSB has unit value. I had to remind myself that the condition
considered in #2 is that SomeDouble is within the value range of int, so
the hazard being tested is not one of overflow. Irrespective of this
condition, however, there are two scenarios which ceil(SomeDouble) can be
split into. One is that the exponent scaling of SomeDouble leaves some
fractional bits, and the other is that it doesn't. If there are some
fractional bits, then the resolution of SomeDouble in that value range is
obviously more precise than a unity step, so integers are precisely
representable, and ceil should return the right value. If there are no
fractional bits, then SomeDouble has an integral value, and passing it
through the ceil function should result in no change, regardless of the
resolution of SomeDouble in that value range i.e. ceil should be able to
return the correct value as a double.

The unintuitive result of this (to me) is that SomeDouble *always* returns
precisely the right answer. Whether it fits into an int is a different
issue (issue#1). I suspect this is what Chris was illustrating.

Comments, confirmations, and corrections welcome.

Fred
 
F

Fred Ma

Jack said:
As others have mentioned, on 64-bit platforms some integer types, and
perhaps even type int on some, have 64 bits and doubles usually have
fewer mantissa bits than this.

What I haven't seen anyone else point out, so far, is the fact that
this implementation-defined characteristic is available to your
program via the macros DECIMAL_DIG and DBL_DIG in <float.h>.

Hi, Jack,

I found these definitions at Dinkum:

DECIMAL_DIG
#define DECIMAL_DIG <#if expression >= 10> [added with C99]
The macro yields the minimum number of decimal digits needed to represent all the significant digits for type long double.

FLT_DIG
#define FLT_DIG <#if expression >= 6>
The macro yields the precision in decimal digits for type float.

I guess the point is that one can infer the bit-width of the mantissa from
them. Thanks.

Fred
 
D

dandelion

Will a double always be able to exactly
represent any value of type int?

Wether (strictly speaking) it will or won't I wouldn't dare to say given the
plethora of representations in use. What I *can* say from my own expirience
is "Do not count on it".

Since the mantissa can (within its limits) represent an integer exactly, you
can simply set the exponent to 1 and the integer could be represented
exactly. However, M_PI/M_PI seldomly equals 1.000000.
 
M

Michael Mair

Fred said:
Jack said:
As others have mentioned, on 64-bit platforms some integer types, and
perhaps even type int on some, have 64 bits and doubles usually have
fewer mantissa bits than this.

What I haven't seen anyone else point out, so far, is the fact that
this implementation-defined characteristic is available to your
program via the macros DECIMAL_DIG and DBL_DIG in <float.h>.

I found these definitions at Dinkum:

DECIMAL_DIG
#define DECIMAL_DIG <#if expression >= 10> [added with C99]
The macro yields the minimum number of decimal digits needed to represent all the significant digits for type long double.

FLT_DIG
#define FLT_DIG <#if expression >= 6>
The macro yields the precision in decimal digits for type float.

I guess the point is that one can infer the bit-width of the mantissa from
them. Thanks.

Umh, for the "bit width" rather use DBL_MANT_DIG, after you made
sure that FLT_RADIX is 2 (which is the base you expect).
If you want to know the highest exactly representable number (in the
"contiguous" subset, of course), you can calculate it from there or use
(assuming base 2) 2.0/DBL_EPSILON. Use a conversion to unsigned int and
back to find out whether unsigned can hold this value.

Cheers
Michael
 
F

Fred Ma

Michael said:
Fred said:
Jack said:
As others have mentioned, on 64-bit platforms some integer types, and
perhaps even type int on some, have 64 bits and doubles usually have
fewer mantissa bits than this.

What I haven't seen anyone else point out, so far, is the fact that
this implementation-defined characteristic is available to your
program via the macros DECIMAL_DIG and DBL_DIG in <float.h>.

I found these definitions at Dinkum:

DECIMAL_DIG
#define DECIMAL_DIG <#if expression >= 10> [added with C99]
The macro yields the minimum number of decimal digits needed to represent all the significant digits for type long double.

FLT_DIG
#define FLT_DIG <#if expression >= 6>
The macro yields the precision in decimal digits for type float.

I guess the point is that one can infer the bit-width of the mantissa from
them. Thanks.

Umh, for the "bit width" rather use DBL_MANT_DIG, after you made
sure that FLT_RADIX is 2 (which is the base you expect).
If you want to know the highest exactly representable number (in the
"contiguous" subset, of course), you can calculate it from there or use
(assuming base 2) 2.0/DBL_EPSILON. Use a conversion to unsigned int and
back to find out whether unsigned can hold this value.


Thanks, Michael.

Fred
 
D

dandelion

Fred Ma said:
I imagine that would depend on how division is implemented.

Fred


----- Original Message -----
From: "Fred Ma" <[email protected]>
Newsgroups: comp.lang.c
Sent: Friday, October 22, 2004 2:03 PM
Subject: Re: Can a double always represent an int exactly?

I imagine that would depend on how division is implemented.

Of course, that's why I wrote "seldomly". And which implementation would
return 1.000000, exactly? I'm curious. Try a few CPU's/FPU's and check the
results. I'll buy you a beer if
you find one.

I wonder why all that 'epsilon-squared' stuff was good for back in HIO and
why the informatics teacher kept hammering us with "Never compare two floats
for equality! Never!".

Must have been a geek, worrying about such detail.
 
K

Keith Thompson

dandelion said:
Of course, that's why I wrote "seldomly". And which implementation would
return 1.000000, exactly? I'm curious. Try a few CPU's/FPU's and check the
results. I'll buy you a beer if
you find one.

I just tried this on a wide variety of systems; M_PI/M_PI compares
equal to 1.0 on all but one of them. (The exception was a Cray SV1.)

Here's the program I used:

#include <stdio.h>
#include <math.h>
int main(void)
{
double var_M_PI = M_PI;
double ratio = M_PI / M_PI;
double var_ratio = var_M_PI / var_M_PI;
printf("M_PI = %g\n", M_PI);
printf("var_M_PI = %g\n", var_M_PI);
printf("ratio = %g\n", ratio);
printf("ratio %s 1.0\n", ratio == 1.0 ? "==" : "!=");
printf("var_ratio = %g\n", var_ratio);
printf("var_ratio %s 1.0\n", var_ratio == 1.0 ? "==" : "!=");
return 0;
}

Caveats: A moderately clever compiler could compute the value at
compilation time (I didn't check this, but I didn't use any
optimization options). And of course M_PI is non-standard.
 
F

Fred Ma

Keith said:
I just tried this on a wide variety of systems; M_PI/M_PI compares
equal to 1.0 on all but one of them. (The exception was a Cray SV1.)

Here's the program I used:

#include <stdio.h>
#include <math.h>
int main(void)
{
double var_M_PI = M_PI;
double ratio = M_PI / M_PI;
double var_ratio = var_M_PI / var_M_PI;
printf("M_PI = %g\n", M_PI);
printf("var_M_PI = %g\n", var_M_PI);
printf("ratio = %g\n", ratio);
printf("ratio %s 1.0\n", ratio == 1.0 ? "==" : "!=");
printf("var_ratio = %g\n", var_ratio);
printf("var_ratio %s 1.0\n", var_ratio == 1.0 ? "==" : "!=");
return 0;
}

Caveats: A moderately clever compiler could compute the value at
compilation time (I didn't check this, but I didn't use any
optimization options). And of course M_PI is non-standard.


In Canada, Moosehead beer is pretty good. :)

Seriously, I wasn't implying that practical implementations of
division were necessarily sophisticated enough to recognize
equivalence of numerator and denominator. What I should ahve
said was that I can see such a discrepancy arising, since
division is not straightforward to implement. I'm talking about
cases that aren't optimized away at compile time.

Fred
 
C

Chris Torek

A few minor corrections...

This is correct (well, I can imagine a weird implementation that
deliberately makes "double"s have constant precision by often
wasting a lot of space; it seems quite unlikely though).

It occurs to me now that "precision" is not properly defined here.
When dealing with scientific notation and decimal numbers, something
like 1.23e+10 is less precise than 1.230e+10. The precision here
is determined by the number of digits in the mantissa (which is
why we have to use the "e+10" notation to suppress "unwanted"
trailing zeros).

Using this definition of precision, and keeping in mind that most
computers today use powers of 2 (binary floating point) rather than
powers of ten (decimal floating point), we actually do have "constant
precision", such as "always exactly 24 bits of mantissa" (provided
we ignore those pesky "denorms" :) ).

This is of course not what the original poster and I meant by
"precision" (as illustrated below) -- we were referring to digits
beyond the decimal point after conversion to printed form via "%f",
for instance. Note, however, that IBM "hex float" (as used on
S/360 -- floating point with a radix of 16 instead of 2) really
*does* have "precision wobble": the number of "useful" bits in the
mantissa changes as numbers change in magnitude. This gives the
numerical analysis folks headaches. IEEE floating point is rather
better behaved.

I need to fix one more typo though:
... [using] "float" ... on most of today's implementations.
Here, we get "interesting" effects near 8388608.0 and 16777216.0.
Values below 16777216.0 step by ones: 8388608.0 is followed
immediately by 8388609.0, for instance, and 16777215.0 is followed
immediately by 16777216.0. On the other hand, below (float)(1<<23)
or above (float)(1<<24), we step by 1/2 or 2 respectively. Using
nextafterf() (if you have it) and variables set to the right values,
you might printf() some results and find:

nextafterf(8388608.0, -inf) = 8388607.5
nextafterf(16777216.0, +inf) = 16777216.2

This last line should read:

nextafterf(16777216.0, +inf) = 16777218.0

(I typed this all in manually, rather than writing C code to
call nextafterf(), display the results as above, and then
cut-and-paste -- so I added 0.2 instead of 2.0 when I made
the change by hand.)
 
F

Fred Ma

Chris said:
A few minor corrections...

This is correct (well, I can imagine a weird implementation that
deliberately makes "double"s have constant precision by often
wasting a lot of space; it seems quite unlikely though).

It occurs to me now that "precision" is not properly defined here.
When dealing with scientific notation and decimal numbers, something
like 1.23e+10 is less precise than 1.230e+10. The precision here
is determined by the number of digits in the mantissa (which is
why we have to use the "e+10" notation to suppress "unwanted"
trailing zeros).

Using this definition of precision, and keeping in mind that most
computers today use powers of 2 (binary floating point) rather than
powers of ten (decimal floating point), we actually do have "constant
precision", such as "always exactly 24 bits of mantissa" (provided
we ignore those pesky "denorms" :) ).

This is of course not what the original poster and I meant by
"precision" (as illustrated below) -- we were referring to digits
beyond the decimal point after conversion to printed form via "%f",
for instance. Note, however, that IBM "hex float" (as used on
S/360 -- floating point with a radix of 16 instead of 2) really
*does* have "precision wobble": the number of "useful" bits in the
mantissa changes as numbers change in magnitude. This gives the
numerical analysis folks headaches. IEEE floating point is rather
better behaved.

I need to fix one more typo though:
... [using] "float" ... on most of today's implementations.
Here, we get "interesting" effects near 8388608.0 and 16777216.0.
Values below 16777216.0 step by ones: 8388608.0 is followed
immediately by 8388609.0, for instance, and 16777215.0 is followed
immediately by 16777216.0. On the other hand, below (float)(1<<23)
or above (float)(1<<24), we step by 1/2 or 2 respectively. Using
nextafterf() (if you have it) and variables set to the right values,
you might printf() some results and find:

nextafterf(8388608.0, -inf) = 8388607.5
nextafterf(16777216.0, +inf) = 16777216.2

This last line should read:

nextafterf(16777216.0, +inf) = 16777218.0

(I typed this all in manually, rather than writing C code to
call nextafterf(), display the results as above, and then
cut-and-paste -- so I added 0.2 instead of 2.0 when I made
the change by hand.)

Chris, thanks for the correction. I think I got the gist of
it from your original post. I did a blanket reply elaborating on it,
Fri. Oct. 22 Message-ID <[email protected]>. Thanks
for helping me get my brain around it, and if you have any comments
on that, I'm certainly interested.

Fred
 
D

Dik T. Winter

> Seriously, I wasn't implying that practical implementations of
> division were necessarily sophisticated enough to recognize
> equivalence of numerator and denominator. What I should ahve
> said was that I can see such a discrepancy arising, since
> division is not straightforward to implement. I'm talking about
> cases that aren't optimized away at compile time.

It is not straightforward to implement. Nevertheless, whenever the FPU
conforms to the IEEE standard the division *must* deliver the exact
answer if the quotient is representable. So on all systems using such
FPU's (and that is the majority at this moment) should deliver 1.0 when
confronted with a/a, in whatever way it is disguised. To get division
right is not straigthforward, but it is not so very difficult either.

That Keith Thompson found that it was not the case on a Cray SV1 is
entirely because that system has not an IEEE conforming floating point
system. (That machine does not have a divide instruction. It
calculates an approximation of the inverse of the denominator and
multiplies with the numerator, and one Newton iteration is performed.
Due to some quirks it may give an inexact result. If I remember
right, the smallest integral division that is inexact is 17.0/17.0.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top