float? double?

Joe Wright · Aug 19, 2006

Dik said:
When precision is a problem, it is much better to analyse *why* it is a
problem. That is much better than going to double, and if that does not
cut it, going to long double.

Right you are. The lowly float with 24-bit mantissa is precise to one in
sixteen million. How close do you need to be?

CBFalconer · Aug 19, 2006

Joe said:
Right you are. The lowly float with 24-bit mantissa is precise to
one in sixteen million. How close do you need to be?

You can often get away with much worse. 25 years ago I had a
system with 24 bit floats, which yielded 4.8 digits precision, but
was fast (for its day) and rounded properly. This was quite
adequate to do least square fits to 3rd order polynomials, which
involve some not too well behaved matrix inversions.

The proper rounding is critical. Early on I checked some
operations with a Basic implementation, and got poorer results
because the Basic truncated rather than rounding, even though it
had 8 more bits of precision! I think that was one of Microsofts.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943

Keith Thompson · Aug 19, 2006

Joe Wright said:
Right you are. The lowly float with 24-bit mantissa is precise to one
in sixteen million. How close do you need to be?

It depends on the application (and on the quality of the
implementation).

Dik T. Winter · Aug 19, 2006

....
> I think the name "double" probably comes from Fortran, where "DOUBLE
> PRECISION" is exactly twice the size of "FLOAT". (I'm not much of a
> Fortran person, so I could easily be mistaken.)

And originally also exactly twice the precision. A double precision
number was implemented as two single precision numbers. In many cases
handled in software.

pete · Aug 19, 2006

jacob said:
double means DOUBLE PRECISION. Note the word precision here.

When you want more precision, use double precision, and if that
doesn't cut it use long double.

Precision means the number of significant digits you get
for the calculations. float gives you 6 digits precision,
(assuming IEEE 754) double gives you 15, and long double
more than that, using Intel/Amd implementation gives you 18.

If precision is not important (you are only interested
in a rough approximation) float is great.

float, is a low ranking type
which is subject to the default argument promotions.
double, is the more natural type.

Walter Roberson · Aug 20, 2006

Joe Wright said:
The 24-bit mantissa of the float demands 8 decimal digits for its
representation. The 53-bit double mantissa demands 16 decimal digits.

I had occasion, some time ago, to express float and double as text and
then from text back to float and double. Exactly.

Given a double, text is..

char buf[30]; /* more than enough */
sprintf(buf, "%.16e", dbl);

For a float..
sprintf(buf, "%.8e", flt);

Now use of atof() or strtod() will take the text back to floating point.
Exactly.

Hmmm -- it is not obvious to me that exact conversion will happen in
that case, Joe. 8 or 16 decimal digits gets you to the point at which
you can precisely pin down the last decimal digit displayed, but there
may have been up to around 3 additional bits worth of information
stored without being able to select the precise decimal digit for
output.

For example, the system might know that the bottom 3 bits are 011, but
be unable to decide whether to output a 4 (.375 rounded up through
((.45 minus epsilon) rounded down), or a 5 (.45 rounded up through
((.5 minus epsilon) rounded down)). The alternative is to print out
more digits than are really present, in order to get enough
information to fill the bottom bits.

Malcolm · Aug 20, 2006

Keith Thompson said:
It depends on the application (and on the quality of the
implementation).

There's hardly any application where an accuracy of 1 in 16 million is not
acceptable. For instance if you are machining space shuttle parts it is
unlikely they go to a tolerance of more than about 1 in 10000.
The real problem is that errors can propagate. If you multiply by a million,
suddenly you only have an accuracy of 1 in 16.

Joe Wright · Aug 20, 2006

Walter said:
Joe Wright said:

The 24-bit mantissa of the float demands 8 decimal digits for its
representation. The 53-bit double mantissa demands 16 decimal digits.

I had occasion, some time ago, to express float and double as text and
then from text back to float and double. Exactly.

Click to expand...

Given a double, text is..

Click to expand...

char buf[30]; /* more than enough */
sprintf(buf, "%.16e", dbl);

Click to expand...

For a float..
sprintf(buf, "%.8e", flt);

Click to expand...

Now use of atof() or strtod() will take the text back to floating point.
Exactly.

Click to expand...

Hmmm -- it is not obvious to me that exact conversion will happen in
that case, Joe. 8 or 16 decimal digits gets you to the point at which
you can precisely pin down the last decimal digit displayed, but there
may have been up to around 3 additional bits worth of information
stored without being able to select the precise decimal digit for
output.

For example, the system might know that the bottom 3 bits are 011, but
be unable to decide whether to output a 4 (.375 rounded up through
((.45 minus epsilon) rounded down), or a 5 (.45 rounded up through
((.5 minus epsilon) rounded down)). The alternative is to print out
more digits than are really present, in order to get enough
information to fill the bottom bits.

I suggest you are wrong Walter. What three extra bits are you talking
about? My point is that a given float printed with sprintf(buff,".8e",f)
will produce a string that when presented to atof() or strtod() will
produce the original float value exatcly.

Same for sprintf(buff,".16e",d) for double.

Tim Prince · Aug 20, 2006

Joe said:
Walter said:

Joe Wright said:

The 24-bit mantissa of the float demands 8 decimal digits for its
representation. The 53-bit double mantissa demands 16 decimal digits.

I had occasion, some time ago, to express float and double as text
and then from text back to float and double. Exactly.

Click to expand...

Given a double, text is..

Click to expand...

char buf[30]; /* more than enough */
sprintf(buf, "%.16e", dbl);

Click to expand...

For a float..
sprintf(buf, "%.8e", flt);

Click to expand...

Now use of atof() or strtod() will take the text back to floating
point. Exactly.

Click to expand...

Hmmm -- it is not obvious to me that exact conversion will happen in
that case, Joe. 8 or 16 decimal digits gets you to the point at which
you can precisely pin down the last decimal digit displayed, but there
may have been up to around 3 additional bits worth of information
stored without being able to select the precise decimal digit for output.

For example, the system might know that the bottom 3 bits are 011, but
be unable to decide whether to output a 4 (.375 rounded up through
((.45 minus epsilon) rounded down), or a 5 (.45 rounded up through
((.5 minus epsilon) rounded down)). The alternative is to print out
more digits than are really present, in order to get enough
information to fill the bottom bits.

Click to expand...

I suggest you are wrong Walter. What three extra bits are you talking
about? My point is that a given float printed with sprintf(buff,".8e",f)
will produce a string that when presented to atof() or strtod() will
produce the original float value exatcly.

Same for sprintf(buff,".16e",d) for double.

According to the IEEE 754 standards, %.9e format for float data type,
and %.17e for double, are required to avoid losing accuracy, and this
can be supported only within well defined ranges. Standard C doesn't
assure you that IEEE754 is followed, but it cannot improve on it.

Gordon Burditt · Aug 20, 2006

There's hardly any application where an accuracy of 1 in 16 million is not

acceptable.

Two common exceptions to this are currency and time.

Accountants expect down-to-the-penny (or whatever the smallest unit
of currency is) accuracy no matter what. And governments spend
trillions of dollars a year.

If your time base is in the year 1AD, and you subtract two current-day
times (stored in floats) to get an interval, you can get rounding
error in excess of an hour. Even for POSIX time (epoch 1 Jan 1970),
you still have rounding errors in excess of 1 minute.

For instance if you are machining space shuttle parts it is
unlikely they go to a tolerance of more than about 1 in 10000.
The real problem is that errors can propagate. If you multiply by a million,
suddenly you only have an accuracy of 1 in 16.

If you had a precision of 1 in 16 million, and you multiply by a
million (an exact number), you still have 1 in 16 million. You
lose precision when you SUBTRACT nearly-equal numbers. If you
subtract two POSIX times about 1.1 billion seconds past the epoch,
but store these in floats before the subtraction, your result for
the difference is only accurate to within a minute. This stinks if
the real difference is supposed to be 5 seconds.

Gordon Burditt · Aug 20, 2006

Hmmm -- it is not obvious to me that exact conversion will happen in

that case, Joe. 8 or 16 decimal digits gets you to the point at which
you can precisely pin down the last decimal digit displayed, but there
may have been up to around 3 additional bits worth of information
stored without being able to select the precise decimal digit for
output.

For example, the system might know that the bottom 3 bits are 011, but
be unable to decide whether to output a 4 (.375 rounded up through
((.45 minus epsilon) rounded down), or a 5 (.45 rounded up through
((.5 minus epsilon) rounded down)). The alternative is to print out
more digits than are really present, in order to get enough
information to fill the bottom bits.

He *is* printing more digits than are guaranteed to exist.
A float is guaranteed to have 6 significant decimal digits. For IEEE
floats, this number is about 6.9 digits. But he's printing 8 digits,
which is (I suspect, I haven't tested this) necessary to ensure that
every representable value has a unique representation.

Dik T. Winter · Aug 20, 2006

> I suggest you are wrong Walter. What three extra bits are you talking
> about? My point is that a given float printed with sprintf(buff,".8e",f)
> will produce a string that when presented to atof() or strtod() will
> produce the original float value exatcly.

(And ".16e" for double.)

That is right for IEEE. To get round-trip exactness when reading in
and printing back again the maximum number of decimal digits allowed is
floor((p - 1) log_10 b), where p is the number of base b digits.
Round-trip exactness the other way around requires
ceil(p log_10 b + 1) decimal digits.

For IEEE that means FLT_DIG=6 and DBL_DIG=15. For correct conversion
in all cases the other way around you need 9 digits for float and
17 digits for double. In a "%.e" format you have subtract 1 from the
required number (because there is always a digit printed in front),
so 8 and 16 are good enough for IEEE.

That 8 as total number of digits for two floats is not enough can be
shown with the pair a = 1073741824.0 and b = 1073741760.0.

Dik T. Winter · Aug 20, 2006

> According to the IEEE 754 standards, %.9e format for float data type,
> and %.17e for double, are required to avoid losing accuracy, and this
> can be supported only within well defined ranges.

Are you sure? 9 digits and 17 digits are sufficient, but %.9e gives 10
digits and %.17e gives 18 digits. I think you committed the same error
I did at first, not counting the digit before the decimal point.

Dik T. Winter · Aug 20, 2006

>
> Two common exceptions to this are currency and time.
>
> Accountants expect down-to-the-penny (or whatever the smallest unit
> of currency is) accuracy no matter what. And governments spend
> trillions of dollars a year.

Right. One of the reasons to use fixed point for this, and not floating
point. With fixed point you can get the rounding as it should be.

> If your time base is in the year 1AD, and you subtract two current-day
> times (stored in floats) to get an interval, you can get rounding
> error in excess of an hour. Even for POSIX time (epoch 1 Jan 1970),
> you still have rounding errors in excess of 1 minute.

Again, a good reason not to use floating point for this, but fixed point.

Dik T. Winter · Aug 20, 2006

> He *is* printing more digits than are guaranteed to exist.

Hrm. What do you mean with "guaranteed to exist"?

> A float is guaranteed to have 6 significant decimal digits.

No. It is guaranteed that a decimal number with 6 significant decimal
digits, when read in and printed out again with the same precision
will yield the original number.

> For IEEE
> floats, this number is about 6.9 digits. But he's printing 8 digits,
> which is (I suspect, I haven't tested this) necessary to ensure that
> every representable value has a unique representation.

No, he is printing 9 digits (do not forget the leading digit on %.e
formats), and these are indeed necessary.

Joe Wright · Aug 20, 2006

Gordon said:
Two common exceptions to this are currency and time.

Accountants expect down-to-the-penny (or whatever the smallest unit
of currency is) accuracy no matter what. And governments spend
trillions of dollars a year.

If your time base is in the year 1AD, and you subtract two current-day
times (stored in floats) to get an interval, you can get rounding
error in excess of an hour. Even for POSIX time (epoch 1 Jan 1970),
you still have rounding errors in excess of 1 minute.

If you had a precision of 1 in 16 million, and you multiply by a
million (an exact number), you still have 1 in 16 million. You
lose precision when you SUBTRACT nearly-equal numbers. If you
subtract two POSIX times about 1.1 billion seconds past the epoch,
but store these in floats before the subtraction, your result for
the difference is only accurate to within a minute. This stinks if
the real difference is supposed to be 5 seconds.

You're making all this up, aren't you? Posix time today is somewhere
around 1,156,103,121 seconds since the Epoch. We are therefore a little
over half way to the end of Posix time in early 2038. Total Posix
seconds are 2^31 or 2,147,483,648 seconds. I would not expect to treat
such a number with a lowly float with only a 24-bit mantissa. I do point
out that double has a 53-bit mantissa and is very much up to the task.

Aside: Why was time_t defined as 32-bit signed integer? What was
supposed to happen when time_t assumes LONG_MAX + 1 ? Why was there no
time to be considered before the Epoch. Arrogance of young men I assume.

The double type would have been a much better choice for time_t.

We must try to remain clear ourselves about the difference between
accuracy and precision. It is the representation that offers the
precision, it is our (the programmer's) calculations that may provide
accuracy.

Ian Collins · Aug 20, 2006

Joe said:
Aside: Why was time_t defined as 32-bit signed integer? What was
supposed to happen when time_t assumes LONG_MAX + 1 ? Why was there no
time to be considered before the Epoch. Arrogance of young men I assume.

For the same reason we had the year 2K bug?

The double type would have been a much better choice for time_t.

Probably not on the hardware available in the '70s.

Chris Torek · Aug 20, 2006

Aside: Why was time_t defined as 32-bit signed integer?

Why was int32_t defined as a 16-bit integer?

(For that matter, why *do* cats paint?)

Seriously, though:

The double type would have been a much better choice for time_t.

The original Unix time was a 16-bit type.

The original Unix epoch was moved several times (three, I think).

Then they got sick of that, and finally went to 32-bit (and 24-bit,
for disk block numbers; anyone remember "l3tol()"?) integers, and
eventually added "long" to the C language. After that came "unsigned
long", and now we have "long long" and "unsigned long long" and
there is no reason[%] not to make time_t a 64-bit type, as it is on
some systems.

[% Well, "backwards compatibility", especially with all those
binary file formats. Some people planned ahead, and some did not.
Some code will break, and some will not.]

Dik T. Winter · Aug 20, 2006

> Aside: Why was time_t defined as 32-bit signed integer?

What operating system?

> What was
> supposed to happen when time_t assumes LONG_MAX + 1 ? Why was there no
> time to be considered before the Epoch. Arrogance of young men I assume.

Why were only two digits used to specify the year?

> The double type would have been a much better choice for time_t.

Not at all. time_t should be an integral type, otherwise you can get
problems with rounding.

Keith Thompson · Aug 21, 2006

Chris Torek said:
Why was int32_t defined as a 16-bit integer?

(For that matter, why *do* cats paint?)

Seriously, though:

The original Unix time was a 16-bit type.

The original Unix epoch was moved several times (three, I think).

Um, are you sure about that? 16 bits with 1-second resolution only
covers about 18 hours. Even 1-minute resolution only covers about a
month and a half.

1970 was very early in the history of Unix. I wouldn't think there'd
have been much time to shift the epoch.

Weird Behavior with Rays in C and OpenGL	4	Feb 12, 2024
Chatbot	0	Oct 8, 2024
Return pointer from void only gives the memory address	0	Nov 23, 2024
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
TF-IDF	2	Aug 19, 2021
Nested _Generic selections	9	Jan 12, 2012
Tasks	1	Nov 29, 2022
Help with Visual Lightbox: Scripts	2	May 3, 2023

float? double?

Joe Wright

CBFalconer

Keith Thompson

Dik T. Winter

pete

Walter Roberson

Malcolm

Joe Wright

Tim Prince

Gordon Burditt

Gordon Burditt

Dik T. Winter

Dik T. Winter

Dik T. Winter

Dik T. Winter

Joe Wright

Ian Collins

Chris Torek

Dik T. Winter

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads