Finding limits of precision for float types

S

Scott

Is there a reliable low-overhead approach to testing the limits of
precision for float types on an arbitrary platform?

I'm converting an integer to a real number such that 0<=n<1, and then
converting the real back to an integer. I want to determine the
largest integer (bit-wise) that can be converted in this way and still
have foo(bar(x))==x always true.

Using gcc 2.95.3 on FreeBSD/i386 (also djgpp 2.8.0/DOS) I wrote a BFI
test program to experimentally determine that a float can represent
only 24 bits in this way - at 25 bits there are over 8 million fatal
rounding errors in the range 0-2^25.

One apparent clue is that the IEEE float type has a 23-bit mantissa,
and lets me store 24 bits. Would an IEEE double store 53 bits? But
then, not all platforms use IEEE floats, and don't even have to use a
standard floating-point format internally to store C floating point
types. C only guarantees a minimum range, right?

I don't have the CPU resources to test larger numbers with my naive
approach (my program would take almost 28 centuries to fully explore a
double on my machine). So, is there a more elegant way to find out?

Disclaimer: I know this issue could be considered OT, but I'm
specifically working to solve this in portable standard C.

Thanks,
-Scott
 
S

Sidney Cadot

Scott said:
Is there a reliable low-overhead approach to testing the limits of
precision for float types on an arbitrary platform?

Not an easy one, as far as I am aware of, no.
I'm converting an integer to a real number such that 0<=n<1, and then
converting the real back to an integer. I want to determine the
largest integer (bit-wise) that can be converted in this way and still
have foo(bar(x))==x always true.

Some time ago, I proposed here that such constants would be included in
Using gcc 2.95.3 on FreeBSD/i386 (also djgpp 2.8.0/DOS) I wrote a BFI
test program to experimentally determine that a float can represent
only 24 bits in this way - at 25 bits there are over 8 million fatal
rounding errors in the range 0-2^25.

Yes. With a single precision IEEE-754 float, you can represent all
integers in the range [-(2^25) .. +(2^25)] correctly; e.g. it cannot
represent 2^25+1.
One apparent clue is that the IEEE float type has a 23-bit mantissa,
and lets me store 24 bits. Would an IEEE double store 53 bits?

Yes, all 53-bit integers (including +0 and -0), and 2^53 and -2^53 work
as well. Smallest problem value: 2^53+1 (and its negative counterpart).
But
then, not all platforms use IEEE floats, and don't even have to use a
standard floating-point format internally to store C floating point
types. C only guarantees a minimum range, right?

That's right, but the C99 standard includes an appendix that has some
things to say about this. Support for this is not very widely available,
though.
I don't have the CPU resources to test larger numbers with my naive
approach (my program would take almost 28 centuries to fully explore a
double on my machine). So, is there a more elegant way to find out?

I once wrote the same thing, doing basically a binary search, which
worked quite well. However, it depends on how many assumptions you want
to make, e.g. a binary radix?

In the case of binary, it is basically guaranteed that

int_to_float_to_int(i)!=i) || int_to_float_to_int(i+1)!=(i+1)

....if you're out of range (one of the numbers i, i+1 shall not be
representable), and

int_to_float_to_int(i)==i) && int_to_float_to_int(i+1)==(i+1)

if you're in range (both i and i+1 are representable).
Disclaimer: I know this issue could be considered OT, but I'm
specifically working to solve this in portable standard C.

This can only be done if you presume a few things (some standard
expondent/mantissa representation, and radix). Theoretically,
exponent/mantissa is not mandated, but it's used everywhere. Radices
other than 2 do occur in practice though.


Best regards,

Sidney
 
T

Tim Prince

Scott said:
Is there a reliable low-overhead approach to testing the limits of
precision for float types on an arbitrary platform?

I'm converting an integer to a real number such that 0<=n<1, and then
converting the real back to an integer. I want to determine the
largest integer (bit-wise) that can be converted in this way and still
have foo(bar(x))==x always true.
...
One apparent clue is that the IEEE float type has a 23-bit mantissa,
and lets me store 24 bits. Would an IEEE double store 53 bits? But
then, not all platforms use IEEE floats, and don't even have to use a
standard floating-point format internally to store C floating point
types. C only guarantees a minimum range, right?

I don't have the CPU resources to test larger numbers with my naive
approach (my program would take almost 28 centuries to fully explore a
double on my machine). So, is there a more elegant way to find out?

Disclaimer: I know this issue could be considered OT, but I'm
specifically working to solve this in portable standard C.

Thanks,
-Scott
Standard C provides the <float.h> header which gives you the information you
need to answer these questions, without actually testing. One of the most
frequently used standard C test programs to construct <float.h> is enquire.
It used to be included in gcc configuration. That will show you how to find
your answers in a few minutes, without assuming prior existence of
<float.h> You should have found both of these in the research you did
before posting.

IEEE 754 float stores 24 bits, including the suppressed bit. See your
<float.h> Yes, then double stores 53. The largest integral values to
which you can add 1, preserving a difference of 1, are 1/FLT_EPSILON and
the like. All float values larger than .5/FLT_EPSILON are exact integers.
 
T

Tim Prince

IEEE 754 float stores 24 bits, including the suppressed bit. See your
<float.h> Yes, then double stores 53. The largest integral values to
which you can add 1, preserving a difference of 1, are 1/FLT_EPSILON and
the like. All float values larger than .5/FLT_EPSILON are exact integers.
Sorry, make that the largest integral value from which 1 can be subtracted
exactly.
 
E

E. Robert Tisdale

Scott said:
Is there a reliable low-overhead approach to testing the limits of
precision for float types on an arbitrary platform?

I'm converting an integer to a real number such that 0<=n<1, and then
converting the real back to an integer. I want to determine the
largest integer (bit-wise) that can be converted in this way and still
have foo(bar(x))==x always true.

Using gcc 2.95.3 on FreeBSD/i386 (also djgpp 2.8.0/DOS) I wrote a BFI
test program to experimentally determine that a float can represent
only 24 bits in this way - at 25 bits there are over 8 million fatal
rounding errors in the range 0-2^25.

One apparent clue is that the IEEE float type has a 23-bit mantissa,
and lets me store 24 bits. Would an IEEE double store 53 bits? But
then, not all platforms use IEEE floats, and don't even have to use a
standard floating-point format internally to store C floating point
types. C only guarantees a minimum range, right?

I don't have the CPU resources to test larger numbers with my naive
approach (my program would take almost 28 centuries to fully explore a
double on my machine). So, is there a more elegant way to find out?
> cat main.c
#include <stdio.h>
#include <limits.h>

int main(int argc, char* argv[]) {
for (unsigned int j = 0; j < UINT_MAX; ++j) {
float x = j;
if ((unsigned int)x != j) {
fprintf(stderr, "%u = j\n", j);
break;
}
}
return 0;
}
> gcc -Wall -std=c99 -pedantic -o main main.c
> time ./main
16777217 = j
1.365u 0.001s 0:01.36 100.0% 0+0k 0+0io 74pf+0w

The largest integer value in your sequence
that you can store in a float is:

FLT_RADIX^FLT_MANT_DIG = 2^24 = 16777216


The largest integer value in your sequence
that you can store in a double is:

FLT_RADIX^DBL_MANT_DIG = 2^53 = 9007199254740992
 
S

Scott

In the case of binary, it is basically guaranteed that

int_to_float_to_int(i)!=i) || int_to_float_to_int(i+1)!=(i+1)

Is this test valid for arbitrary radices if we replace (i+1) with
(i+radix-1)?
This can only be done if you presume a few things (some standard
expondent/mantissa representation, and radix). Theoretically,
exponent/mantissa is not mandated, but it's used everywhere. Radices
other than 2 do occur in practice though.

Probably a safe assumption, considering the nature of my program.

Thanks,
-Scott
 
S

Scott

You should have found both of these in the research you did
before posting.

Yes, I should have been an authority as soon as I finished K&R, but I
guess I've been getting there the hard way. Thank you for pointing
out float.h and enquire.c; now that I look more closely at them, they
do indeed provide all the information I need.
IEEE 754 float stores 24 bits, including the suppressed bit. See your
<float.h> Yes, then double stores 53. The largest integral values to
which you can add 1, preserving a difference of 1, are 1/FLT_EPSILON and
the like. All float values larger than .5/FLT_EPSILON are exact integers.

Assuming that float.h does exist seems best. It looks like FLT_RADIX
and FLT_MANT_DIG are sufficient. A trivial test program can decide
from these which FP type will provide adequate precision.

In retrospect it's obvious that the precision of a float cannot exceed
the precision of its mantissa, give or take a hidden bit. It's all so
simple, once you figure out what you're doing.

Thanks again,
-Scott
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top