Float precision (0.5678f becomes 0.56779998540878)

Mad Butch · Sep 17, 2003

void Test()
{
float fValue = 0.5678f; // Value is 0.567800
double dValue = (double)fValue; // Value is 0.56779998540878
}

Is there anyway I can round a float at 6 positions behind the decimal?
That way the value would be 0.567800 again?

Gianni Mariani · Sep 17, 2003

Mad said:
void Test()
{
float fValue = 0.5678f; // Value is 0.567800
double dValue = (double)fValue; // Value is 0.56779998540878
}

Is there anyway I can round a float at 6 positions behind the decimal?
That way the value would be 0.567800 again?

try this:

#include <iostream>
#include <cmath>

int main()
{
float fValue = 0.5678f;
double dValue = (double)fValue;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;

dValue = std::floor( 1e6l * fValue + 0.5l ) / 1e6l;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;
}

Howard · Sep 17, 2003

Gianni Mariani said:
try this:

#include <iostream>
#include <cmath>

int main()
{
float fValue = 0.5678f;
double dValue = (double)fValue;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;

dValue = std::floor( 1e6l * fValue + 0.5l ) / 1e6l;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;
}

Without trying that code out, it appears that it *writes out* a result that
looks like what the OP was asking for.

However, that is not what I got from the OP's question. There was no "cout"
in his code. I *think* what was being asked was how to actually get the
value stored in memory to have the same value as a double that it had as a
float. And the answer to that is, in general, you can't!

Floating-point values are stored in binary, not decimal, and there (in most
cases) no way to get a specific floating-point value accurate to some number
of *decimal* places. It can only be accurate to some number of *binary*
digits, because it's binary, not decimal.

One should, in general, never rely on a floating-point number to be exactly
anything (except perhaps a whole number, such as zero).

-Howard

Kevin Goodsell · Sep 17, 2003

Howard said:
One should, in general, never rely on a floating-point number to be exactly
anything (except perhaps a whole number, such as zero).

There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

As a specific example, the IEEE format that is used on a number of
systems (including Intel x86 systems) has doubles that can represent
every integer from something like -9,007,199,254,740,992 to
+9,007,199,254,740,992 (I worked that out myself, so I may have made a
mistake). Outside that range, I don't think you can find two consecutive
integers that can be precisely represented.

-Kevin

Jack Klein · Sep 18, 2003

There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

Sure there is. The C++ standard specifically inherits <float.h> from
ISO C, with either that name or <climits>, and makes the following
normative: ISO C subclause 7.1.5, 5.2.4.2.2, 5.2.4.2.1.

DBL_DIG, FLT_DIG, and LDBL_DIG are required to be available and have
the same meaning they have in C, namely the maximum number of decimal
digits a whole number can contain and be guaranteed to be converted to
the floating point type and back again with the original value.

These might be available in <limits> as well, I haven't looked it up.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++ ftp://snurse-l.org/pub/acllc-c++/faq

Kevin Goodsell · Sep 18, 2003

Jack said:
Sure there is. The C++ standard specifically inherits <float.h> from
ISO C, with either that name or <climits>, and makes the following
normative: ISO C subclause 7.1.5, 5.2.4.2.2, 5.2.4.2.1.

DBL_DIG, FLT_DIG, and LDBL_DIG are required to be available and have
the same meaning they have in C, namely the maximum number of decimal
digits a whole number can contain and be guaranteed to be converted to
the floating point type and back again with the original value.

These might be available in <limits> as well, I haven't looked it up.

Oh, OK. Most of those macros relating to floating point types are a
mystery to me. Thanks for the correction.

-Kevin

Karl Heinz Buchegger · Sep 18, 2003

Kevin said:
There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

As a specific example, the IEEE format that is used on a number of
systems (including Intel x86 systems) has doubles that can represent
every integer from something like -9,007,199,254,740,992 to
+9,007,199,254,740,992 (I worked that out myself, so I may have made a
mistake). Outside that range, I don't think you can find two consecutive
integers that can be precisely represented.

I doubt that.
The only difference between 0.3 and 30 as a floating point number
is the exponent.

0.3 is stored (of course in binary) as 0.3 E 0
30.0 is stored as 0.3 E 2

so if the same mantissa is used and we know that 0.3
cannot be represented exactly by a binary floating point
number, then 30 also cannot be represented exactly.
Of course the real thing is different, since the exponent
usually is not base 10, but base 2, but the principle is the
same.

Rob Williscroft · Sep 18, 2003

Karl Heinz Buchegger wrote in

I doubt that.
The only difference between 0.3 and 30 as a floating point number
is the exponent.

0.3 is stored (of course in binary) as 0.3 E 0

Note that 0.3 can't be stored (exactly) in binary, 3 can and so
can 0.5, 0.25, 0.125 ... etc. I mean binary as base 2 not just
as a collection of bits (which is probably what you meant).

30.0 is stored as 0.3 E 2

so if the same mantissa is used and we know that 0.3
cannot be represented exactly by a binary floating point
number, then 30 also cannot be represented exactly.

0.3 can't be represented exactly as an integer multiplied by a
power of 2, 30 can. My point here is that your argument is
backwards, it should be a floating point value is an integer
multiplied by some other integer (usually 2 but possibly 10)
raised to the power of yet another integer. I.e. you can
inferer things about a non-integer by making reasoned arguments
about integers not the other way around.

Of course the real thing is different, since the exponent
usually is not base 10, but base 2, but the principle is the
same.

I get your point but jack said: "Outside that range, I don't think you
can find two consecutive integers that can be precisely represented."

You missed "...consecutive ..." I think.

Rob.

Karl Heinz Buchegger · Sep 18, 2003

Rob said:
0.3 can't be represented exactly as an integer multiplied by a
power of 2, 30 can. My point here is that your argument is
backwards, it should be a floating point value is an integer
multiplied by some other integer (usually 2 but possibly 10)
raised to the power of yet another integer. I.e. you can
inferer things about a non-integer by making reasoned arguments
about integers not the other way around.

I'm not sure I understand what you are trying to say to me.
My point is (ist's hard to express all of this for me, since
I am not a native english speaker, so be patient)

the mantissa can be seen as the sum of fractions.
Thus you need to find the coefficients for

sizeof( double ) * 8 (assuming 8 bits per byte)
+----
\ 1
\ bit * -----
/ i 2
/ i
+----
i = 0

such that this sum equals the requested floating point number.

eg.
1 1 1 1 1
0.3 = 0 * - + 1 * - + 0 * - + 0 * -- + 1 * -- + ....
2 4 8 16 32

thus the bit sequence for 0.3 starts with 01001....
I don't know, if this repeated summing finally ends up at 0.3
(within the limited bits of double), but if it does not
(replace 0.3 with some other number if necc.), then all
2-multiples (0.6, 1.2, 2.4, 4.8, 9.6, ... ) of that number will
also be not representable.

Ohh. Now I see where I have gone wrong and what you mean.
I started with the some number and tried to end up at an
integer with that dubbling. I tried to base my doubt on the fact
that not all numbers in the range 0 .. 1) are representable by
a sum_of_fractions.
But in fact the opposite is happening.
I should have started with an integer and by halfing that (and
incementing the exponent) I will always end up with a number
in the range 0 .. 1( where such a sum_of_fractions exists.

mantissa exponent
3.0 0
1.5 1
0.75 2

0.75 = 0.5 + 0.25

1 1 2
( - + - ) * 2 = 3.0
2 4

Kevin was right and I was wrong.
Thanks for making me think, although
I should have done that before replying.

Rob Williscroft · Sep 18, 2003

Rob Williscroft wrote in 195.129.110.130:

I get your point but jack said: "Outside that range, I don't think you
can find two consecutive integers that can be precisely represented."

My apologies to Kevin and Jack for some reason I thought Karl was
responding to Jack and not Kevin and attributed Kevin's statments
to Jack.

Rob.

Kevin Goodsell · Sep 18, 2003

Karl said:
I doubt that.
The only difference between 0.3 and 30 as a floating point number
is the exponent.

0.3 is stored (of course in binary) as 0.3 E 0
30.0 is stored as 0.3 E 2

so if the same mantissa is used and we know that 0.3
cannot be represented exactly by a binary floating point
number, then 30 also cannot be represented exactly.
Of course the real thing is different, since the exponent
usually is not base 10, but base 2, but the principle is the
same.

It looks like you've worked it out (I didn't totally follow the
discussion), but let me explain my logic, and show how integers are
represented in IEEE doubles.

First, the IEEE double is 64 bits. 1 sign bit, 11 exponent bits, and 52
mantissa bits. The mantissa actually has one more 'implied' bit - it's
value is always 1, and it is logically placed to the left of the other
52 bits. Now, integer value 1 is represented like this:

1|.0000[...]0000

Where the 1 is not actually stored, but is implied (represented here by
using a vertical bar to separate it from the physical bits). All the
physical mantissa bits are zero ([...] is used so I don't have to type
all 52 bits), and the binary point (represented with a dot) is placed
between the implied 1 bit and the 'real' mantissa bits. The exponent
isn't shown, but it is implied by the position of the binary point (you
run out of mantissa long before you run out of exponent, so there's no
need to watch for exponent overflow in this example).

Moving on, the integer 2 is represented this way:

1|0.0000[...]0000

Only the exponent (position of the binary point) changes. 3 looks like this:

1|1.0000[...]0000

4-9:

1|00.000[...]0000
1|01.000[...]0000
1|10.000[...]0000
1|11.000[...]0000
1|000.00[...]0000
1|001.00[...]0000

If you continue this for a very, very long time, you arrive at this:

1|1111[...]1111.

All the mantissa bits are set to 1. This isn't quite the end, though.
You can add one more to get this:

1|0000[...]0000|0.

Now there's an implied 0 on the right side of the mantissa. Actually,
there's a lot of implied zeros over there, they just haven't come into
play until now. This should give the value I listed as the upper limit
(pow(2, 53)).

At this point, you can't add 1 and get anything different. It would
require flipping the implied 0 bit, which you can't do. You can add 2
and get this, though:

1|0000[...]0001|0.

There are obviously many more integers that can be precisely
represented, but none of them are contiguous - given two contiguous
integers, one must have it's least significant bit set to 1 (in other
words, must be odd), but all integers above this point use an implied 0
bit for their least significant bit. As you go higher, more implied 0
bits are used, making the distance from one exact integer to the next
even greater. For a while, you have only even integers. Then, only
integers that are divisible by 4, then 8, 16, etc. (until you run out of
exponent, a very long time later).

-Kevin

Java MemoryLayout/ValueLayout Questions.	2	Feb 5, 2023
std::stringstream returning back values	8	Jan 9, 2008
Java OpenJDK Floating Point Dare	3	Jan 17, 2023
Truncating decimal places in a float (possibly OT)	2	Jul 18, 2008
Problem with a friend declaration when migrating project	1	Aug 23, 2012
User Input	4	Jan 5, 2008
float and double precision	9	Mar 26, 2009
Howto identify a string value vs. a numeric value in std::string	9	Sep 18, 2007

Float precision (0.5678f becomes 0.56779998540878)

Mad Butch

Gianni Mariani

Howard

Kevin Goodsell

Jack Klein

Kevin Goodsell

Karl Heinz Buchegger

Rob Williscroft

Karl Heinz Buchegger

Rob Williscroft

Kevin Goodsell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads