convert 32bit numbers to 64bit (or float to double)

Sebastian Gibb · Jun 18, 2010

Hello,

a long time ago I had to use a foreign C application to generate some
numbers. This application saves the numbers as 32bit (float) values in a
file.
I had to use R (www.r-project.org) to read the files. It imports the
values as 64bit (double, R only knows doubles) and generates some pseudo
numbers at position after decimal point.
I want to show you an example C++ code, which does nearly the same:
#include <iostream>
#include <iomanip>
#include <limits>

using namespace std;

int main() {
float myFloat = 1234.56;
double myDouble = myFloat;

cout << setprecision(numeric_limits<float>::digits10) << "myFloat: " <<
myFloat << endl;
cout << setprecision(numeric_limits<double>::digits10) << "myDouble: " <<
myDouble << endl;

return 0;
}

output:
myFloat: 1234.56
myDouble: 1234.56005859375

You could see, the fifth position after decimal point becomes a pseudo
value.
Now I am able to replace the foreign C application by an own R application.
(The algorithm uses double values, too.)
For compatibility reasons I want to get the same values for R like the old C
ones. Until now I use an C-binding for R to do the following:

double precision32(double value) {
float x = value;
return (double)x;
}

I want to know what happens when I call "double myDouble=myFloat" and how
can I simulate sth. like that with only using double values?

Kind regards,

Sebastian

Victor Bazarov · Jun 18, 2010

[...]
I want to know what happens when I call "double myDouble=myFloat" and how
can I simulate sth. like that with only using double values?

I thin you will benefit from studying this article:

http://docs.sun.com/source/806-3568/ncg_goldberg.html

When you comprehend everything it presents, review your code and your
approaches, and if you still have some questions, come and ask them.

V

Sebastian Gibb · Jun 26, 2010

Victor said:
[...]
I want to know what happens when I call "double myDouble=myFloat" and how
can I simulate sth. like that with only using double values?

Click to expand...

I thin you will benefit from studying this article:

http://docs.sun.com/source/806-3568/ncg_goldberg.html

When you comprehend everything it presents, review your code and your
approaches, and if you still have some questions, come and ask them.

V

Hello,

after reading the article "What Every Computer Scientist Should Know About
Floating-Point Arithmetic" by Mr. Goldberg I update my code.
I don't understand why it is working only partially.
I test some floating point numbers. Only half of them are converted
correctly.
18.4 -> correct
999.4813232421875 -> correct
1/3 -> not working
0.1 -> not working

What do I wrong?

Kind regards,

Sebastian

You find my code at: http://pastebin.com/yczrW8br

Sebastian Gibb · Jun 26, 2010

Hello,

Shorten it to the minimum and post it here. Most of us don't click on
links. At least I don't.

Sorry, I thought nobody would read long code without syntax highlighting.
#include <cmath>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std;

// some constants from IEEE 754
const int nBitsSingleMantissa = 23;
const int nBitsSingleExzess = 8;
const int nBitsDoubleMantissa = 52;
const int nBitsDoubleExzess = 11;

// old method using by another cpp application
// it is my reference method
double convertWithCast(double value) {
float x = value;
return (double)x;
}

// try to simulate the same behaviour without using floats
struct IEEEBinary {
int signedBit;
vector<int > exzess;
vector<int > mantissa;
};

vector<int > swapVectorOrder(const vector<int >& x) {
vector<int > y;
for (int i=x.size()-1; i >= 0; --i) {
y.push_back(x);
}
return y;
}

double calcExzess(int nEBits) {
return pow(2, nEBits-1)-1;
}

IEEEBinary double2binary(double x, int nMBits, int nEBits) {
// calculate mantissa
// before point
int pre = floor(abs(x));
vector<int > preMantissa;

while (pre != 0) {
preMantissa.push_back(pre % 2);
pre = floor(pre/2.0);
}

if (preMantissa.size() > 1)
preMantissa = swapVectorOrder(preMantissa);

// after point
double post = x - floor(x);
vector<int > postMantissa;
for (unsigned int i=0; i<2*nMBits; ++i) {
post = post * 2;
int pre = floor(post);
postMantissa.push_back(pre);
post -= pre;
}

vector<int > mantissa = preMantissa;
mantissa.insert(mantissa.end(), postMantissa.begin(), postMantissa.end());

// normalize
vector<int >::iterator it;

for (it = mantissa.begin(); it != mantissa.end(); ++it) {
if (*it == 1)
break;
}
// save size for exzess calc
unsigned int sMantissa = mantissa.size();
// remove leading zeros and first 1
it = mantissa.erase(mantissa.begin(), (it+1));
// save new size for exzess calc
unsigned int sMantissa2 = mantissa.size();

// round
if (mantissa.at(nMBits+1) == 1) {
mantissa.at(nMBits) = 1;
}

// cut
mantissa.erase(it+nMBits, mantissa.end());
//mantissa.erase(mantissa.end());

// exzess
int ex = calcExzess(nEBits) + preMantissa.size() - (sMantissa-sMantissa2);

vector<int > exzess;

while (ex != 0) {
exzess.push_back(ex % 2);
ex = floor(ex/2.0);
}

// append zeros to exzess
if (exzess.size() < nEBits) {
for (unsigned int i=exzess.size(); i<nEBits; ++i)
exzess.push_back(0);
}

exzess = swapVectorOrder(exzess);

// signed bit
int signedBit = 0;

if (x < 0) {
signedBit = 1;
}

// build binary struct
IEEEBinary bin;
bin.signedBit = signedBit;
bin.mantissa = mantissa;
bin.exzess = exzess;

return bin;
}

double binary2double(const IEEEBinary& binary) {
int exzess = 0;

for (unsigned int i = 0; i < binary.exzess.size(); ++i)
exzess += binary.exzess*pow(2, binary.exzess.size()-(i+1));

exzess -= calcExzess(binary.exzess.size());

double value = pow(2, exzess);

for (unsigned int i = 0; i < binary.mantissa.size(); ++i) {
value += binary.mantissa*pow(2, exzess-(int)(i+1));
}

if (binary.signedBit == 1)
value *= (-1);

return value;
}

// wrapper function
double convertWithoutCast(double value) {
return binary2double(double2binary(value, nBitsSingleMantissa,
nBitsSingleExzess));
}

int main() {
vector<double > testValues;
testValues.push_back(1.0/3.0);
testValues.push_back(18.4);
testValues.push_back(0.1);
testValues.push_back(999.4813232421875);

for (vector<double >::iterator it=testValues.begin(); it !=
testValues.end(); ++it) {
double oldConv = convertWithCast(*it);
double newConv = convertWithoutCast(*it);

if (oldConv != newConv) {
cout << setprecision(22) << *it << ": " << oldConv << " != " <<
newConv << endl;
}
}

return 0;
}

// the output:
0.3333333333333333148296:
0.3333333432674407958984 != 0.3333333134651184082031
0.1000000000000000055511:
0.1000000014901161193848 != 0.09999999403953552246094

// it works for
18.4 and 999.4813232421875

I think, I do something wrong because the old method with typical c-cast
returns a different value in comparison to my new method without c-cast.

Kind regards,

Sebastian

Sebastian Gibb · Jun 27, 2010

Hello,

Victor said:
Apparently it either contains hardware-specific code (which I don't see
right away) or contains a logical error (for which, while on vacation, I
really don't care to search) - when I took your code and tried debugging
it with VC10, I got first of all some errors I needed to correct (mostly
the use of an ambiguous 'pow'), and second of all, a debugging assertion
failed in one of the functions, the iterator was out of bounds.

I use g++ 4.4.1 and get no warnings caused by 'pow'. (g++ -Wall ...)

Your code is overly complex, I believe. And it doesn't seem to contain
any test cases. Consider writing test cases, like expecting a zeroed
mantissa with a power of 2, and a particular mantissa. When you split
your number into the mantissa and "exzess" (exponent), you really need
to make sure your splitting code works right before relying on it for
your "conversion".

Thank for your advice. I will add some test cases and hope to find the
logical error.

Kind regards,

Sebastian

Jorgen Grahn · Jun 27, 2010

Hello,

I use g++ 4.4.1 and get no warnings caused by 'pow'. (g++ -Wall ...)

Note that g++ -Wall does NOT mean "enable all warnings". See your
documentation for details.

That said, all I got from gcc 4.3.2 was a few signedness warnings, no
matter which flags I added.

/Jorgen

converting double to int	1	Nov 19, 2013
Implementing a Q-Learning Algorithm with Logistic Regression Normalization in C++	0	Jun 4, 2025
How to test a 'float' or 'double' zero numerically?	14	Sep 13, 2008
sizeof (long double)	2	Feb 18, 2011
Converting double (or float) to int	1	Jun 8, 2007
double or float error	7	Jun 8, 2005
Strange behaviour of tiny double numbers	5	Jun 18, 2006
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026

convert 32bit numbers to 64bit (or float to double)

Sebastian Gibb

Victor Bazarov

Sebastian Gibb

Sebastian Gibb

Sebastian Gibb

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads