Basics on real floating types

Dann Corbit · Nov 11, 2009

Keith Thompson said:
Keith Thompson said:

Dann Corbit said:

With memory at low, low prices it seem hard to come up with sensible
reasons to prefer float over double for the vast majority of
applications.

Click to expand...

[...]

However much memory you've got, there will still be times (probably
rarely) when being able to store twice as many numbers is worthwhile.

Click to expand...

Very much so. However, often the interface between the two types
is more common on ingress/egress. *Every* variable in the program
can be considered a 'temporary intermediate' to be acted on at
double precision.

There is also a built-in paradox.

In the situations where float helps the most, it is the worst choice.
For example, suppose that your matrix is too large to fit into RAM using
double. So you switch to float. But a large matrix using float
precision is far more likely to suffer numerically than a smaller one
(where the float speed is not needed).

The number of operations in matrix multiply is O(N^3), so if N is {for
instance} 100, then the number of operations will be proportional to
1,000,000. If N is 1000, then the number of operations will be
proportional to 1,000,000,000. As you can see, in the places where
space becomes the most scarce, the operation count goes through the
roof. So the accumulated rounding error becomes a more and more serious
problem. Of course, there are cases where there is simply no choice but
to use float because the double matrix will not fit into RAM, and it is
not sparse. In those cases, it is really important to analyze the
stability of the calculation so that loss of precision does not render
the answer worthless {and you don't even know it}. (IOW, it's time to
put Dik Winter on the payroll).
;-)

Francis Moreau · Nov 12, 2009

Dann Corbit said:
This is almost always true. It is also almost always true that the
double answer will be more accurate. So do you want a less reliable
answer faster or a more reliable answer more slowly? It's not always
simple.

Funny, this is actually the goal of my post...

On a 64 bit system with a 64 bit compiler and operating system there is
no real penalty for 64 bit operations (other than increased size of
applications due to doubling of pointer width). But operations on 64
bit integers are ultra-fast.

Again I work on systems embedding a 32 bits CPUs where speed matters but
reliable result matters even more.

This is probably why C langage is spread on these systems BTW. If I
would work on a 64 bits systems with large memory (> 2GB, so my computer
for example), I would probably blindly use 'double' or even 'long
double' always (well actually it's a lie, I wouldn't use C actually).

This is a complicated subject, called "Numerical Analysis".

There are number packages that will calculate this for you using range
arithmetic:
http://portal.acm.org/citation.cfm?id=138377

Of course, there is a large speed penalty for using range arithmetic.

For things like solving a linear system you can examine the condition
number of a matrix. There are some problems (for instance) that have an
exact answer, and yet are incredibly difficult to calculate. An example
is solving a linear system where the matrix is a Hilbert matrix:
http://en.wikipedia.org/wiki/Hilbert_matrix
The condition number of a Hilbert matrix goes "coo coo for cocoa puffs"
as the matrix gets large.
Even though there is an exact solution, even a moderately small matrix
will give absurdly wrong answers unless extended or arbitrary precision
is used for the calculation.

Ok thanks.

Dik T. Winter · Nov 13, 2009

> For things like solving a linear system you can examine the condition
> number of a matrix. There are some problems (for instance) that have an
> exact answer, and yet are incredibly difficult to calculate. An example
> is solving a linear system where the matrix is a Hilbert matrix:
> http://en.wikipedia.org/wiki/Hilbert_matrix
> The condition number of a Hilbert matrix goes "coo coo for cocoa puffs"
> as the matrix gets large.
> Even though there is an exact solution, even a moderately small matrix
> will give absurdly wrong answers unless extended or arbitrary precision
> is used for the calculation.

You actually need exact rational arithmetic (also for the coefficients),
otherwise it can go spectacularly wrong for higher orders, whatever the
precision.

Java OpenJDK Floating Point Dare	3	Jan 17, 2023
Types	58	Dec 10, 2006
Accessing array elements via floating point formats.	33	Dec 10, 2010
converting floating point types round off error ....	13	Oct 5, 2008
Types in C	117	May 22, 2011
How to use single precision floating point?	10	Aug 7, 2010
Minimum value of floating point types.	5	Oct 8, 2009
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024

Basics on real floating types

Dann Corbit

Francis Moreau

Dik T. Winter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads