Basics on real floating types

D

Dann Corbit

Keith Thompson said:
Dann Corbit said:
With memory at low, low prices it seem hard to come up with sensible
reasons to prefer float over double for the vast majority of
applications.
[...]

However much memory you've got, there will still be times (probably
rarely) when being able to store twice as many numbers is worthwhile.

Very much so. However, often the interface between the two types
is more common on ingress/egress. *Every* variable in the program
can be considered a 'temporary intermediate' to be acted on at
double precision.

There is also a built-in paradox.

In the situations where float helps the most, it is the worst choice.
For example, suppose that your matrix is too large to fit into RAM using
double. So you switch to float. But a large matrix using float
precision is far more likely to suffer numerically than a smaller one
(where the float speed is not needed).

The number of operations in matrix multiply is O(N^3), so if N is {for
instance} 100, then the number of operations will be proportional to
1,000,000. If N is 1000, then the number of operations will be
proportional to 1,000,000,000. As you can see, in the places where
space becomes the most scarce, the operation count goes through the
roof. So the accumulated rounding error becomes a more and more serious
problem. Of course, there are cases where there is simply no choice but
to use float because the double matrix will not fit into RAM, and it is
not sparse. In those cases, it is really important to analyze the
stability of the calculation so that loss of precision does not render
the answer worthless {and you don't even know it}. (IOW, it's time to
put Dik Winter on the payroll).
;-)
 
F

Francis Moreau

Dann Corbit said:
This is almost always true. It is also almost always true that the
double answer will be more accurate. So do you want a less reliable
answer faster or a more reliable answer more slowly? It's not always
simple.

Funny, this is actually the goal of my post...
On a 64 bit system with a 64 bit compiler and operating system there is
no real penalty for 64 bit operations (other than increased size of
applications due to doubling of pointer width). But operations on 64
bit integers are ultra-fast.

Again I work on systems embedding a 32 bits CPUs where speed matters but
reliable result matters even more.

This is probably why C langage is spread on these systems BTW. If I
would work on a 64 bits systems with large memory (> 2GB, so my computer
for example), I would probably blindly use 'double' or even 'long
double' always (well actually it's a lie, I wouldn't use C actually).
This is a complicated subject, called "Numerical Analysis".

There are number packages that will calculate this for you using range
arithmetic:
http://portal.acm.org/citation.cfm?id=138377

Of course, there is a large speed penalty for using range arithmetic.

For things like solving a linear system you can examine the condition
number of a matrix. There are some problems (for instance) that have an
exact answer, and yet are incredibly difficult to calculate. An example
is solving a linear system where the matrix is a Hilbert matrix:
http://en.wikipedia.org/wiki/Hilbert_matrix
The condition number of a Hilbert matrix goes "coo coo for cocoa puffs"
as the matrix gets large.
Even though there is an exact solution, even a moderately small matrix
will give absurdly wrong answers unless extended or arbitrary precision
is used for the calculation.

Ok thanks.
 
D

Dik T. Winter

> For things like solving a linear system you can examine the condition
> number of a matrix. There are some problems (for instance) that have an
> exact answer, and yet are incredibly difficult to calculate. An example
> is solving a linear system where the matrix is a Hilbert matrix:
> http://en.wikipedia.org/wiki/Hilbert_matrix
> The condition number of a Hilbert matrix goes "coo coo for cocoa puffs"
> as the matrix gets large.
> Even though there is an exact solution, even a moderately small matrix
> will give absurdly wrong answers unless extended or arbitrary precision
> is used for the calculation.

You actually need exact rational arithmetic (also for the coefficients),
otherwise it can go spectacularly wrong for higher orders, whatever the
precision.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top