float and double precision

carsonmyers · Mar 26, 2009

so, I hear that the float and double intrinsic types don't have very
good precision, due to the way which they store the data--they are
meant to be stored as 1.xxxxx right? Not 1000.xxx or 0.xxx, anyways--
if something like,

float x=1000;
float y=1000.43;
std::cout<< y-x<< std::endl;

will result in 0.429xxxx or whatever,
why can calculator do it?
Is there some other type that can be used to store precise numbers of
any size?

Victor Bazarov · Mar 26, 2009

so, I hear that the float and double intrinsic types don't have very
good precision, due to the way which they store the data--they are
meant to be stored as 1.xxxxx right?

Actually, it's 0.1xxxxx in binary, usually. IOW, the mantissa value is
always in the range [0.5, 1).

> Not 1000.xxx or 0.xxx, anyways--
if something like,

float x=1000;
float y=1000.43;
std::cout<< y-x<< std::endl;

will result in 0.429xxxx or whatever,

It could.

why can calculator do it?

Because it probably uses more digits of precision than 'float'...

Is there some other type that can be used to store precise numbers of
any size?

Yes, look on the web for "arbitrary precision floating point library".
Or you could use rationals (if your algorithm allows that). Or go for
some kind of mathematical formula for the number. You're still going to
be SOL with numbers like Pi or e (which aren't from a formula, really).

The built-in FP types are limited, there are only three. In addition to
the two you've named there is the 'long double', which is allowed to be
implemented as 'double'. Sucks, don' it?

V

osmium · Mar 26, 2009

so, I hear that the float and double intrinsic types don't have very
good precision, due to the way which they store the data--they are
meant to be stored as 1.xxxxx right? Not 1000.xxx or 0.xxx, anyways--
if something like,

float x=1000;
float y=1000.43;
std::cout<< y-x<< std::endl;

will result in 0.429xxxx or whatever,
why can calculator do it?
Is there some other type that can be used to store precise numbers of
any size?

If calculators expressed numbers in binary form, they would have the general
problem you allude to. I think they use a 4-bit group to represent a single
decimal digit, and the exponent is similarly, represented as a separate, but
associated, 2 decimal digit datum. IOW, it is not what a computer person
would recognize as a floating point form. It is closer to what is called
binary coded decimal but it is not that either. But you can get a better
idea of what I am talking about looking up BCD, on Wiki say. Using 4 bits
to represent only 10 digits is wasteful, so be it.

Juha Nieminen · Mar 27, 2009

Victor said:
Yes, look on the web for "arbitrary precision floating point library".

Not that it will be of any help with regard to rounding errors as a
result of calculations/conversions. A value like 0.1 cannot be
represented accurately with base-2 floating point numbers even if you
use a gigabyte of RAM to store it (but you will get pretty darn close,
though).

Carson Myers · Mar 27, 2009

would it be practical to use two int values, do you think?
Like in a class--one for the fractional part, and one for the whole-
number part-
that way I suppose it would be possible to avoid rounding errors since
the fractional part wouldn't really be treated as a fractional part,
but rather as a regular integer--you'd just have to worry about
handling the math and the behavior of the <1 portion of it yourself,
which would (I can imagine) be slow.

However I don't really understand how 0.1 could not be represented...
I've read about how it's a computer science problem, don't fully
understand but can vaguely grasp the concept (haven't read very much)--
but my compiler will still output 0.1 for float and double--if I
compiled "float x=0.1; std::cout<<x<<std::endl;" on another compiler
or system it may show something else? Unbelievable...

Tim Love · Mar 27, 2009

However I don't really understand how 0.1 could not be represented...

I've read about how it's a computer science problem, don't fully
understand but can vaguely grasp the concept (haven't read very much)--

http://www.eason.com/library/math/floatingmath.pdf is often mentioned
(What Every Computer Scientist Should Know About Floating-Point Arithmetic)
though http://www.mathworks.com/support/tech-notes/1100/1108.html might be
an easier read.
It's worth knowing about even if you're not a Computer Scientist - spreadsheets exhibit the same trouble - you can't assume that 117/9 and 11.7/.9 are equal, or that adding a, b, and c (in that order) will give you the same answer as adding c, b, and a (in that order). Tough life, which is why programmers are paid so much.

James Kanze · Mar 27, 2009

Actually, it's 0.1xxxxx in binary, usually. IOW, the mantissa
value is always in the range [0.5, 1).

Not 1000.xxx or 0.xxx, anyways--
if something like,
float x=1000;
float y=1000.43;
std::cout<< y-x<< std::endl;
will result in 0.429xxxx or whatever,

Click to expand...

It could.

It can't, in general. Try something like 1.0/3.0.

Because it probably uses more digits of precision than
'float'...

Or because it uses decimal arithmetic. Which is not only
slower (usually), but has the disadvantage of variable
precision.

If you're doing bookkeeping, or working in some other context
where the rounding rules are determined by a legal specification
based on decimal arithmetic, then you need a decimal class which
does decimal arithmetic. Typically, however, such applications
aren't "numbers crunchers", so you can afford the extra runtime.

Yes, look on the web for "arbitrary precision floating point
library".

I'd be interested in seeing one capable of storing the exact
value of pi, or even the exact value of sqrt(2.0). Some numbers
require infinite precision in any base.

In practice, even simple division is a problem. You can only
store 1/n precisely in a finite number of bits if the base being
used is n or a multiple of n. In order to guarantee exactness,
you'd have to use some sort of rational representation.

Note that 10 is a multiple of 2, so with enough bits, you can
store any decimal representation. But this just begs the
question: numbers don't always come from literals or input
strings; they are also the result of expressions like a/b or
sqrt(c).

Or you could use rationals (if your algorithm allows that).
Or go for some kind of mathematical formula for the number.
You're still going to be SOL with numbers like Pi or e (which
aren't from a formula, really).

They can be expressed as the results of an equation.

The built-in FP types are limited, there are only three. In
addition to the two you've named there is the 'long double',
which is allowed to be implemented as 'double'. Sucks, don'
it?

Although neither made it into the final draft, there were
proposals on the table for decimal arithmetic and a rational
class. For that matter, I think the decimal arithmetic is being
adopted in the form of a technical report or something like
that.

James Kanze · Mar 27, 2009

If calculators expressed numbers in binary form, they would
have the general problem you allude to. I think they use a
4-bit group to represent a single decimal digit, and the
exponent is similarly, represented as a separate, but
associated, 2 decimal digit datum. IOW, it is not what a
computer person would recognize as a floating point form.

It sounds like classical floating point to me. I don't know of
any modern machines which use decimal (although IBM mainframes
use base 16, and Unisys mainframes base 8), but they've
certainly existed in the past (e.g. IBM 1401). The C++ standard
references the C standard for this---see §5.2.4.2.2 in the C
standard.

It is closer to what is called binary coded decimal but it is
not that either.

It corresponds exactly to BCD: the number is broken down into
four bit blocks, each of which may take on a value of 0 to 9.

But you can get a better idea of what I am talking about
looking up BCD, on Wiki say. Using 4 bits to represent only
10 digits is wasteful, so be it.

On most machines, it is also slower. More importantly, it means
that the actual precision varies somewhat according to the
stored value; i.e. it has the same problems as IBM's base 16
format. (I'm not competent enough in numerical processing to
judge myself, but I know that some specialists complain loudly
about this.)

Martin Eisenberg · Mar 27, 2009

Actually they fudge it for your viewing pleasure by displaying
results rounded to less digits than they carry internally. To be
fair, there is a solid reason for that arrangement -- it stands a
reasonable chance to keep rounding errors, inevitably accumulating
over a chain of operations, out of the numbers that engineers end up
copying to their notebooks.

In practice, even simple division is a problem. You can only
store 1/n precisely in a finite number of bits if the base being
used is n or a multiple of n. In order to guarantee exactness,
you'd have to use some sort of rational representation.

Note that 10 is a multiple of 2, so with enough bits, you can
store any decimal representation.

You've used your own previous statement in the wrong direction.
Consider that 0.1 (dec) is periodic in binary.

Martin

Victor Bazarov · Mar 27, 2009

Carson said:
would it be practical to use two int values, do you think?

That's what the algorithms based on rational numbers do. No way to
precisely represent Pi or e or the square root of 2 on those, however.

Like in a class--one for the fractional part, and one for the whole-
number part-
that way I suppose it would be possible to avoid rounding errors since
the fractional part wouldn't really be treated as a fractional part,
but rather as a regular integer--you'd just have to worry about
handling the math and the behavior of the <1 portion of it yourself,
which would (I can imagine) be slow.

However I don't really understand how 0.1 could not be represented...

Try calculating the binary representation of it. It's a good exercise.

[..]

V

double precision and boost unittest?	2	Aug 6, 2008
long double precision	11	Nov 16, 2007
convert 32bit numbers to 64bit (or float to double)	5	Jun 18, 2010
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
reading binary file into memory. Converting from char to uint32,float, double, ASCII strings etc (st	37	Oct 15, 2011
are int, float, long, double, side-effects of computer engineering?	15	Mar 6, 2012
some problem with the min exponent of float and double in <limits> header?	1	Mar 18, 2007
Crossword	14	May 13, 2020

float and double precision

carsonmyers

Victor Bazarov

osmium

Juha Nieminen

Carson Myers

Tim Love

James Kanze

James Kanze

Martin Eisenberg

Victor Bazarov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads