Semi OT: Binary representation of floating point numbers

Dilip · Dec 27, 2006

Hello Wise men

This question is purely academic as I am just trying to understand this
stuff a little better. Therefore please free to instruct me to other
suitabe NGs if necessary.

Recently I started a thread (at comp.lang.c) on converting a single
precision floating point number to double precision after running into
a problem at work. Little did I realize it would spawn close to 100
replies. Needless to say I learnt a lot in the discussion.

Considering I will probably run into many such problems in the future I
set out to study the IEEE 754 representation of these numbers. I have
understood it to an extent but some gaps remain. I would be extremely
grateful if one of you more knowledgeable types can set me straight.

What I understand so far: (for single precision)

IEEE says the following is the bit pattern of such numbers:

<1 bit for sign><8 bits for exponent><23 bits for significand>

Mathematically a normalized binary floating point number is represented
as:

(-1)^s * 1.f * 2^(e-127)

where s is the sign bit
f is the 23 bits of significand **fraction**
and e is the biased exponent which can range from 0 to 255

So given a number that is represented like this:

1.00000000000000000000001 * 2^18

which has 23 zeros following the binary decimal point.

we turn that into:

1000000000000000000.00001

To convert that into decimal:

1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
which is basically
262,144.03125

In other words the bit pattern for such a number will look like this:

0 10010001 00000000000000000000001

(sign bit, exponent and significand fraction) [exponent part is
10010001 because e-127 = 18 and hence e = 145 in decimal]

which in hex is 0x48800001

So far so good?

Now my question is given how do I go in the reverse direction?

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

I seem to be missing a step in between.

thanks!

Joe Wright · Dec 28, 2006

Dilip said:
Hello Wise men

This question is purely academic as I am just trying to understand this
stuff a little better. Therefore please free to instruct me to other
suitabe NGs if necessary.

Recently I started a thread (at comp.lang.c) on converting a single
precision floating point number to double precision after running into
a problem at work. Little did I realize it would spawn close to 100
replies. Needless to say I learnt a lot in the discussion.

Considering I will probably run into many such problems in the future I
set out to study the IEEE 754 representation of these numbers. I have
understood it to an extent but some gaps remain. I would be extremely
grateful if one of you more knowledgeable types can set me straight.

What I understand so far: (for single precision)

IEEE says the following is the bit pattern of such numbers:

<1 bit for sign><8 bits for exponent><23 bits for significand>

Mathematically a normalized binary floating point number is represented
as:

(-1)^s * 1.f * 2^(e-127)

where s is the sign bit
f is the 23 bits of significand **fraction**
and e is the biased exponent which can range from 0 to 255

So given a number that is represented like this:

1.00000000000000000000001 * 2^18

which has 23 zeros following the binary decimal point.

we turn that into:

1000000000000000000.00001

To convert that into decimal:

1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
which is basically
262,144.03125

In other words the bit pattern for such a number will look like this:

0 10010001 00000000000000000000001

(sign bit, exponent and significand fraction) [exponent part is
10010001 because e-127 = 18 and hence e = 145 in decimal]

which in hex is 0x48800001

So far so good?

Now my question is given how do I go in the reverse direction?

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

I seem to be missing a step in between.

thanks!

Our usual IEEE float consists of a 24-bit mantissa, an 8-bit exponent
and a one-bit sign. Yes, I know that's 33 bits but bear with me a while.
The mantissa is assumed to be normalized which means the msb of its
value is shifted into the msb of the mantissa (b23). Well, if b23 is
always 1 there is no reason to inspect it. We'll assume it's 1 and
assign the low order exponent bit to b23. Because of this trick we get
24 mantissa bits and 8 exponent bits into 31 bits. The sign is still its
own bit.

The exponent is said to be unsigned but for sake of argument it is
biased to a value of 126.

Regard the value 59.

b31 (sign) is positive
01000010 01101100 00000000 00000000 The binary of the float itself.
Exp = 132 (6) b30..b23 is the exponent. 132 - 126 = 6
00000110
Man = .11101100 00000000 00000000 Mantissa with b23 forced 1.
Now shifting the mantissa left 6 (the exponent) we get..
111011.00 which is 59 or..
5.90000000e+01

Note your 59.89

As float precise to 24 bits.
01000010 01101111 10001111 01011100
Exp = 132 (6)
00000110
Man = .11101111 10001111 01011100
5.98899994e+01

Converted to double, still precise to 24 bits. Conversion does not add
precision.
01000000 01001101 11110001 11101011 10000000 00000000 00000000 00000000
Exp = 1028 (6)
000 00000110
Man = .11101 11110001 11101011 10000000 00000000 00000000 00000000
5.9889999389648438e+01

Now with the x*=1000, x/=1000 trick. Added precision is an illusion?
01000000 01001101 11110001 11101011 10000101 00011110 10111000 01010010
Exp = 1028 (6)
000 00000110
Man = .11101 11110001 11101011 10000101 00011110 10111000 01010010
5.9890000000000001e+01

Joe Wright · Dec 28, 2006

Joe said:
Dilip said:

Hello Wise men

This question is purely academic as I am just trying to understand this
stuff a little better. Therefore please free to instruct me to other
suitabe NGs if necessary.

Recently I started a thread (at comp.lang.c) on converting a single
precision floating point number to double precision after running into
a problem at work. Little did I realize it would spawn close to 100
replies. Needless to say I learnt a lot in the discussion.

Considering I will probably run into many such problems in the future I
set out to study the IEEE 754 representation of these numbers. I have
understood it to an extent but some gaps remain. I would be extremely
grateful if one of you more knowledgeable types can set me straight.

What I understand so far: (for single precision)

IEEE says the following is the bit pattern of such numbers:

<1 bit for sign><8 bits for exponent><23 bits for significand>

Mathematically a normalized binary floating point number is represented
as:

(-1)^s * 1.f * 2^(e-127)

where s is the sign bit
f is the 23 bits of significand **fraction**
and e is the biased exponent which can range from 0 to 255

So given a number that is represented like this:

1.00000000000000000000001 * 2^18

which has 23 zeros following the binary decimal point.

we turn that into:

1000000000000000000.00001

To convert that into decimal:

1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
which is basically
262,144.03125

In other words the bit pattern for such a number will look like this:

0 10010001 00000000000000000000001

(sign bit, exponent and significand fraction) [exponent part is
10010001 because e-127 = 18 and hence e = 145 in decimal]

which in hex is 0x48800001

So far so good?

Now my question is given how do I go in the reverse direction?

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

I seem to be missing a step in between.

thanks!

Click to expand...

Our usual IEEE float consists of a 24-bit mantissa, an 8-bit exponent
and a one-bit sign. Yes, I know that's 33 bits but bear with me a while.
The mantissa is assumed to be normalized which means the msb of its
value is shifted into the msb of the mantissa (b23). Well, if b23 is
always 1 there is no reason to inspect it. We'll assume it's 1 and
assign the low order exponent bit to b23. Because of this trick we get
24 mantissa bits and 8 exponent bits into 31 bits. The sign is still its
own bit.

The exponent is said to be unsigned but for sake of argument it is
biased to a value of 126.

Regard the value 59.

b31 (sign) is positive
01000010 01101100 00000000 00000000 The binary of the float itself.
Exp = 132 (6) b30..b23 is the exponent. 132 - 126 = 6
00000110
Man = .11101100 00000000 00000000 Mantissa with b23 forced 1.
Now shifting the mantissa left 6 (the exponent) we get..
111011.00 which is 59 or..
5.90000000e+01

Note your 59.89

As float precise to 24 bits.
01000010 01101111 10001111 01011100
Exp = 132 (6)
00000110
Man = .11101111 10001111 01011100
5.98899994e+01

Converted to double, still precise to 24 bits. Conversion does not add
precision.
01000000 01001101 11110001 11101011 10000000 00000000 00000000 00000000
Exp = 1028 (6)
000 00000110
Man = .11101 11110001 11101011 10000000 00000000 00000000 00000000
5.9889999389648438e+01

Now with the x*=1000, x/=1000 trick. Added precision is an illusion?
01000000 01001101 11110001 11101011 10000101 00011110 10111000 01010010
Exp = 1028 (6)
000 00000110
Man = .11101 11110001 11101011 10000101 00011110 10111000 01010010
5.9890000000000001e+01

I forgot to note for the double, sign is 1-bit, exponent is 11-bits and
mantissa is 53-bits. Exponent bias is 1022.

Julian V. Noble · Dec 28, 2006

Dilip said:
Hello Wise men

This question is purely academic as I am just trying to understand this
stuff a little better. Therefore please free to instruct me to other
suitabe NGs if necessary.

Recently I started a thread (at comp.lang.c) on converting a single
precision floating point number to double precision after running into
a problem at work. Little did I realize it would spawn close to 100
replies. Needless to say I learnt a lot in the discussion.

Considering I will probably run into many such problems in the future I
set out to study the IEEE 754 representation of these numbers. I have
understood it to an extent but some gaps remain. I would be extremely
grateful if one of you more knowledgeable types can set me straight.

What I understand so far: (for single precision)

IEEE says the following is the bit pattern of such numbers:

<1 bit for sign><8 bits for exponent><23 bits for significand>

Mathematically a normalized binary floating point number is represented
as:

(-1)^s * 1.f * 2^(e-127)

where s is the sign bit
f is the 23 bits of significand **fraction**
and e is the biased exponent which can range from 0 to 255

So given a number that is represented like this:

1.00000000000000000000001 * 2^18

which has 23 zeros following the binary decimal point.

we turn that into:

1000000000000000000.00001

To convert that into decimal:

1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
which is basically
262,144.03125

In other words the bit pattern for such a number will look like this:

0 10010001 00000000000000000000001

(sign bit, exponent and significand fraction) [exponent part is
10010001 because e-127 = 18 and hence e = 145 in decimal]

which in hex is 0x48800001

So far so good?

Now my question is given how do I go in the reverse direction?

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

I seem to be missing a step in between.

thanks!

Have a look at "What every computer scientist should know about floating
point arithmetic" by D. Goldberg. It's available on the web in various
formats. You can download goldberg.pdf from my page

http://galileo.phys.virginia.edu/classes/551.jvn.fall01/

Dik T. Winter · Dec 28, 2006

> Now my question is given how do I go in the reverse direction?
>
> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?

First convert it to binary. You will find that it does not have a finite
binary expansion but it has a repeating expansion when you go far enough.
But when converting it to float you need initially only 24 + some extra
bits. The following 29 will do nicely:
111011.11100011110101101111100
If you truncate that to 23 bits you get:
111011.111000111101011011
that is to small, the next larger representable number is:
111011.111000111101011100
which is too large, but the second is closer to the original
than the first.

christian.bau · Dec 28, 2006

Dilip said:
Mathematically a normalized binary floating point number is represented
as:

(-1)^s * 1.f * 2^(e-127)

where s is the sign bit
f is the 23 bits of significand **fraction**
and e is the biased exponent which can range from 0 to 255

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

To do this by hand: First you want (-1)^s, 1.f and (e-127). With the
example 59.889999:

Clearly (-1)^s = 1, so s = 0.
Since 2^5 <= 59.889999 < 2^6, e-127 equals 5, or e = 132.
1.f * 2^5 = 59.889999, or 1.f = 59.889999 / 2^5.
1.f is a multiple of 2^-23, so 2^23 * (1.f) is an integer.

2^23 * 1.f is the integer nearest to (59.889999 / 2^5) * 2^23 which is
about 15699803.898, so 2^23 * 1.f equals 15,699,804. Write this as a 24
bit binary number; the first binary bit must be a 1.

You need to be careful if you start with a value just below a power of
two, like 63.999999: The value 2^23 * 1.f that you calculate is always
less than 2^24, but rounding to the nearest integer could round it to
2^24 which would make 1.f = 2. In that case you have to choose e one
higher, and 1.f = 1.

David T. Ashley · Dec 28, 2006

Dilip said:
Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

I'm assuming you can calculate the binary representation of the integer
"59".

Since you know where the radix point is, you also know the exponent and the
number of bits you have after the radix point to express the fractional
part.

Assume you have 10 bits after the radix point ...

Then you just solve:

889999 / 1000000 = x / 2^10.

and convert x to binary.

Chris Torek · Dec 28, 2006

Considering I will probably run into many such problems in the future I
set out to study the IEEE 754 representation of these numbers. I have
understood it to an extent but some gaps remain. ...

You have the basics down, it seems. Others have already covered
the basics on conversions from decimal representations.

My article at <http://web.torek.net/torek/c/numbers.html> is
intentionally somewhat lightweight but may also be helpful. In
particular, note the special cases for zero, "sub-normal" or
"denormalized" numbers, Infinities, and NaNs.

Ernie Wright · Dec 28, 2006

Dilip said:
Now my question is given how do I go in the reverse direction?

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

Despite the impression you might get from some of the answers, it's
actually fairly easy to do this, and it doesn't require doing any binary
arithmetic, or in fact anything you can't do pretty quickly with an
ordinary calculator.

Think of a floating-point number f as

f = -1^s * (1 + m/2^b) * 2^e

where s is a sign bit, m is (part of) the mantissa or significand, e is
the exponent, and b is the number of bits of precision. For 32-bit IEEE
floats, b is 23. We start out by finding s, m, and e and worry about
how they fit into the binary representation later.

1. Find the sign. Trivial, but by peeling this off first, the rest of
the work can be done using the absolute value of f.

s = SIGN(f) 1 if negative, otherwise 0
f = |f|

2. Find the base-2 log of f.

Most calculators don't have a button for this, but it's easy to do with
the log buttons they do have.

log2(f) = log(f) / log(2) = 5.9042432031146...

3. Round this DOWN to find the exponent e.

e = 5

This is the largest power of 2 that's less than f.

4. To find the significand, divide f by 2^e.

59.889999 / 32 = 1.87156246875

At this point you have (the absolute value of) f in the form of a
significand between 1.0 and 2.0, multiplied by a power of 2.

1.87156246875 * 2^5

Now you have to represent each component as an integer and pack them
into 32 bits.

5. The sign is the single bit s.

6. The exponent is 8 bits equal to e + 127.

7. The significand integer m is 23 bits:

m = round(( significand - 1 ) * 2^23 )
= round(( 1.87156246875 - 1 ) * 2^23 )
= round( 7311195.897856 )
= 7311196

If, after rounding, m == 2^23, set m = 0 and increment e.

At this point, you can check that you've got the right s, e, and m by
plugging them back into the formula.

1 * (1 + 7311196 / 8388608) * 32 = 59.8899993896484375

8. Finally, pack these three integers into 32 bits.

s << 31 | (e + 127) << 23 | m

On a calculator, multiply and add instead of bit-shift and bitwise-or.

s * 2^31 + ((e + 127) * 2^23) + m

The procedure for doubles is nearly identical. Use b = 52 and build the
bit pattern using

s << 63 | (e + 1023) << 52 | m

- Ernie http://home.comcast.net/~erniew

C++ SSE and SSE2 compiler settings, and their Floating Point effects.	0	May 31, 2022
VHDL Floating Point Multiplier	0	Jan 9, 2011
portable floating-point read/write	11	Feb 12, 2010
Binary representation of floating point numbers	10	Dec 6, 2005
[semi OT] - Lack of long double implementation in VS	10	Oct 23, 2011
Portable marshalling of floating point data	3	Dec 10, 2009
binary for floating point numbers - small?	1	Jan 28, 2011
Unknown floating point representation	3	Jun 22, 2006

Semi OT: Binary representation of floating point numbers

Dilip

Joe Wright

Joe Wright

Julian V. Noble

Dik T. Winter

christian.bau

David T. Ashley

Chris Torek

Ernie Wright

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads