Semi OT: Binary representation of floating point numbers

Discussion in 'C Programming' started by Dilip, Dec 27, 2006.

  1. Dilip

    Dilip Guest

    Hello Wise men

    This question is purely academic as I am just trying to understand this
    stuff a little better. Therefore please free to instruct me to other
    suitabe NGs if necessary.

    Recently I started a thread (at comp.lang.c) on converting a single
    precision floating point number to double precision after running into
    a problem at work. Little did I realize it would spawn close to 100
    replies. Needless to say I learnt a lot in the discussion.

    Considering I will probably run into many such problems in the future I
    set out to study the IEEE 754 representation of these numbers. I have
    understood it to an extent but some gaps remain. I would be extremely
    grateful if one of you more knowledgeable types can set me straight.

    What I understand so far: (for single precision)

    IEEE says the following is the bit pattern of such numbers:

    <1 bit for sign><8 bits for exponent><23 bits for significand>

    Mathematically a normalized binary floating point number is represented
    as:

    (-1)^s * 1.f * 2^(e-127)

    where s is the sign bit
    f is the 23 bits of significand **fraction**
    and e is the biased exponent which can range from 0 to 255

    So given a number that is represented like this:

    1.00000000000000000000001 * 2^18

    which has 23 zeros following the binary decimal point.

    we turn that into:

    1000000000000000000.00001

    To convert that into decimal:

    1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
    which is basically
    262,144.03125

    In other words the bit pattern for such a number will look like this:

    0 10010001 00000000000000000000001

    (sign bit, exponent and significand fraction) [exponent part is
    10010001 because e-127 = 18 and hence e = 145 in decimal]

    which in hex is 0x48800001

    So far so good?

    Now my question is given how do I go in the reverse direction?

    Given a number 59.889999 how do I retrace back to what its binary
    representation looks like?

    I seem to be missing a step in between.

    thanks!
     
    Dilip, Dec 27, 2006
    #1
    1. Advertising

  2. Dilip

    Joe Wright Guest

    Dilip wrote:
    > Hello Wise men
    >
    > This question is purely academic as I am just trying to understand this
    > stuff a little better. Therefore please free to instruct me to other
    > suitabe NGs if necessary.
    >
    > Recently I started a thread (at comp.lang.c) on converting a single
    > precision floating point number to double precision after running into
    > a problem at work. Little did I realize it would spawn close to 100
    > replies. Needless to say I learnt a lot in the discussion.
    >
    > Considering I will probably run into many such problems in the future I
    > set out to study the IEEE 754 representation of these numbers. I have
    > understood it to an extent but some gaps remain. I would be extremely
    > grateful if one of you more knowledgeable types can set me straight.
    >
    > What I understand so far: (for single precision)
    >
    > IEEE says the following is the bit pattern of such numbers:
    >
    > <1 bit for sign><8 bits for exponent><23 bits for significand>
    >
    > Mathematically a normalized binary floating point number is represented
    > as:
    >
    > (-1)^s * 1.f * 2^(e-127)
    >
    > where s is the sign bit
    > f is the 23 bits of significand **fraction**
    > and e is the biased exponent which can range from 0 to 255
    >
    > So given a number that is represented like this:
    >
    > 1.00000000000000000000001 * 2^18
    >
    > which has 23 zeros following the binary decimal point.
    >
    > we turn that into:
    >
    > 1000000000000000000.00001
    >
    > To convert that into decimal:
    >
    > 1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
    > which is basically
    > 262,144.03125
    >
    > In other words the bit pattern for such a number will look like this:
    >
    > 0 10010001 00000000000000000000001
    >
    > (sign bit, exponent and significand fraction) [exponent part is
    > 10010001 because e-127 = 18 and hence e = 145 in decimal]
    >
    > which in hex is 0x48800001
    >
    > So far so good?
    >
    > Now my question is given how do I go in the reverse direction?
    >
    > Given a number 59.889999 how do I retrace back to what its binary
    > representation looks like?
    >
    > I seem to be missing a step in between.
    >
    > thanks!
    >


    Our usual IEEE float consists of a 24-bit mantissa, an 8-bit exponent
    and a one-bit sign. Yes, I know that's 33 bits but bear with me a while.
    The mantissa is assumed to be normalized which means the msb of its
    value is shifted into the msb of the mantissa (b23). Well, if b23 is
    always 1 there is no reason to inspect it. We'll assume it's 1 and
    assign the low order exponent bit to b23. Because of this trick we get
    24 mantissa bits and 8 exponent bits into 31 bits. The sign is still its
    own bit.

    The exponent is said to be unsigned but for sake of argument it is
    biased to a value of 126.

    Regard the value 59.

    b31 (sign) is positive
    01000010 01101100 00000000 00000000 The binary of the float itself.
    Exp = 132 (6) b30..b23 is the exponent. 132 - 126 = 6
    00000110
    Man = .11101100 00000000 00000000 Mantissa with b23 forced 1.
    Now shifting the mantissa left 6 (the exponent) we get..
    111011.00 which is 59 or..
    5.90000000e+01

    Note your 59.89

    As float precise to 24 bits.
    01000010 01101111 10001111 01011100
    Exp = 132 (6)
    00000110
    Man = .11101111 10001111 01011100
    5.98899994e+01

    Converted to double, still precise to 24 bits. Conversion does not add
    precision.
    01000000 01001101 11110001 11101011 10000000 00000000 00000000 00000000
    Exp = 1028 (6)
    000 00000110
    Man = .11101 11110001 11101011 10000000 00000000 00000000 00000000
    5.9889999389648438e+01

    Now with the x*=1000, x/=1000 trick. Added precision is an illusion?
    01000000 01001101 11110001 11101011 10000101 00011110 10111000 01010010
    Exp = 1028 (6)
    000 00000110
    Man = .11101 11110001 11101011 10000101 00011110 10111000 01010010
    5.9890000000000001e+01

    --
    Joe Wright
    "Everything should be made as simple as possible, but not simpler."
    --- Albert Einstein ---
     
    Joe Wright, Dec 28, 2006
    #2
    1. Advertising

  3. Dilip

    Joe Wright Guest

    Joe Wright wrote:
    > Dilip wrote:
    >> Hello Wise men
    >>
    >> This question is purely academic as I am just trying to understand this
    >> stuff a little better. Therefore please free to instruct me to other
    >> suitabe NGs if necessary.
    >>
    >> Recently I started a thread (at comp.lang.c) on converting a single
    >> precision floating point number to double precision after running into
    >> a problem at work. Little did I realize it would spawn close to 100
    >> replies. Needless to say I learnt a lot in the discussion.
    >>
    >> Considering I will probably run into many such problems in the future I
    >> set out to study the IEEE 754 representation of these numbers. I have
    >> understood it to an extent but some gaps remain. I would be extremely
    >> grateful if one of you more knowledgeable types can set me straight.
    >>
    >> What I understand so far: (for single precision)
    >>
    >> IEEE says the following is the bit pattern of such numbers:
    >>
    >> <1 bit for sign><8 bits for exponent><23 bits for significand>
    >>
    >> Mathematically a normalized binary floating point number is represented
    >> as:
    >>
    >> (-1)^s * 1.f * 2^(e-127)
    >>
    >> where s is the sign bit
    >> f is the 23 bits of significand **fraction**
    >> and e is the biased exponent which can range from 0 to 255
    >>
    >> So given a number that is represented like this:
    >>
    >> 1.00000000000000000000001 * 2^18
    >>
    >> which has 23 zeros following the binary decimal point.
    >>
    >> we turn that into:
    >>
    >> 1000000000000000000.00001
    >>
    >> To convert that into decimal:
    >>
    >> 1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
    >> which is basically
    >> 262,144.03125
    >>
    >> In other words the bit pattern for such a number will look like this:
    >>
    >> 0 10010001 00000000000000000000001
    >>
    >> (sign bit, exponent and significand fraction) [exponent part is
    >> 10010001 because e-127 = 18 and hence e = 145 in decimal]
    >>
    >> which in hex is 0x48800001
    >>
    >> So far so good?
    >>
    >> Now my question is given how do I go in the reverse direction?
    >>
    >> Given a number 59.889999 how do I retrace back to what its binary
    >> representation looks like?
    >>
    >> I seem to be missing a step in between.
    >>
    >> thanks!
    >>

    >
    > Our usual IEEE float consists of a 24-bit mantissa, an 8-bit exponent
    > and a one-bit sign. Yes, I know that's 33 bits but bear with me a while.
    > The mantissa is assumed to be normalized which means the msb of its
    > value is shifted into the msb of the mantissa (b23). Well, if b23 is
    > always 1 there is no reason to inspect it. We'll assume it's 1 and
    > assign the low order exponent bit to b23. Because of this trick we get
    > 24 mantissa bits and 8 exponent bits into 31 bits. The sign is still its
    > own bit.
    >
    > The exponent is said to be unsigned but for sake of argument it is
    > biased to a value of 126.
    >
    > Regard the value 59.
    >
    > b31 (sign) is positive
    > 01000010 01101100 00000000 00000000 The binary of the float itself.
    > Exp = 132 (6) b30..b23 is the exponent. 132 - 126 = 6
    > 00000110
    > Man = .11101100 00000000 00000000 Mantissa with b23 forced 1.
    > Now shifting the mantissa left 6 (the exponent) we get..
    > 111011.00 which is 59 or..
    > 5.90000000e+01
    >
    > Note your 59.89
    >
    > As float precise to 24 bits.
    > 01000010 01101111 10001111 01011100
    > Exp = 132 (6)
    > 00000110
    > Man = .11101111 10001111 01011100
    > 5.98899994e+01
    >
    > Converted to double, still precise to 24 bits. Conversion does not add
    > precision.
    > 01000000 01001101 11110001 11101011 10000000 00000000 00000000 00000000
    > Exp = 1028 (6)
    > 000 00000110
    > Man = .11101 11110001 11101011 10000000 00000000 00000000 00000000
    > 5.9889999389648438e+01
    >
    > Now with the x*=1000, x/=1000 trick. Added precision is an illusion?
    > 01000000 01001101 11110001 11101011 10000101 00011110 10111000 01010010
    > Exp = 1028 (6)
    > 000 00000110
    > Man = .11101 11110001 11101011 10000101 00011110 10111000 01010010
    > 5.9890000000000001e+01
    >

    I forgot to note for the double, sign is 1-bit, exponent is 11-bits and
    mantissa is 53-bits. Exponent bias is 1022.

    --
    Joe Wright
    "Everything should be made as simple as possible, but not simpler."
    --- Albert Einstein ---
     
    Joe Wright, Dec 28, 2006
    #3
  4. Dilip wrote:
    > Hello Wise men
    >
    > This question is purely academic as I am just trying to understand this
    > stuff a little better. Therefore please free to instruct me to other
    > suitabe NGs if necessary.
    >
    > Recently I started a thread (at comp.lang.c) on converting a single
    > precision floating point number to double precision after running into
    > a problem at work. Little did I realize it would spawn close to 100
    > replies. Needless to say I learnt a lot in the discussion.
    >
    > Considering I will probably run into many such problems in the future I
    > set out to study the IEEE 754 representation of these numbers. I have
    > understood it to an extent but some gaps remain. I would be extremely
    > grateful if one of you more knowledgeable types can set me straight.
    >
    > What I understand so far: (for single precision)
    >
    > IEEE says the following is the bit pattern of such numbers:
    >
    > <1 bit for sign><8 bits for exponent><23 bits for significand>
    >
    > Mathematically a normalized binary floating point number is represented
    > as:
    >
    > (-1)^s * 1.f * 2^(e-127)
    >
    > where s is the sign bit
    > f is the 23 bits of significand **fraction**
    > and e is the biased exponent which can range from 0 to 255
    >
    > So given a number that is represented like this:
    >
    > 1.00000000000000000000001 * 2^18
    >
    > which has 23 zeros following the binary decimal point.
    >
    > we turn that into:
    >
    > 1000000000000000000.00001
    >
    > To convert that into decimal:
    >
    > 1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
    > which is basically
    > 262,144.03125
    >
    > In other words the bit pattern for such a number will look like this:
    >
    > 0 10010001 00000000000000000000001
    >
    > (sign bit, exponent and significand fraction) [exponent part is
    > 10010001 because e-127 = 18 and hence e = 145 in decimal]
    >
    > which in hex is 0x48800001
    >
    > So far so good?
    >
    > Now my question is given how do I go in the reverse direction?
    >
    > Given a number 59.889999 how do I retrace back to what its binary
    > representation looks like?
    >
    > I seem to be missing a step in between.
    >
    > thanks!
    >


    Have a look at "What every computer scientist should know about floating
    point arithmetic" by D. Goldberg. It's available on the web in various
    formats. You can download goldberg.pdf from my page

    http://galileo.phys.virginia.edu/classes/551.jvn.fall01/


    --
    Julian V. Noble
    Professor Emeritus of Physics
    University of Virginia
     
    Julian V. Noble, Dec 28, 2006
    #4
  5. In article <> "Dilip" <> writes:
    ....
    > Now my question is given how do I go in the reverse direction?
    >
    > Given a number 59.889999 how do I retrace back to what its binary
    > representation looks like?


    First convert it to binary. You will find that it does not have a finite
    binary expansion but it has a repeating expansion when you go far enough.
    But when converting it to float you need initially only 24 + some extra
    bits. The following 29 will do nicely:
    111011.11100011110101101111100
    If you truncate that to 23 bits you get:
    111011.111000111101011011
    that is to small, the next larger representable number is:
    111011.111000111101011100
    which is too large, but the second is closer to the original
    than the first.
    --
    dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
    home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
     
    Dik T. Winter, Dec 28, 2006
    #5
  6. Dilip wrote:

    > Mathematically a normalized binary floating point number is represented
    > as:
    >
    > (-1)^s * 1.f * 2^(e-127)
    >
    > where s is the sign bit
    > f is the 23 bits of significand **fraction**
    > and e is the biased exponent which can range from 0 to 255


    > Given a number 59.889999 how do I retrace back to what its binary
    > representation looks like?


    To do this by hand: First you want (-1)^s, 1.f and (e-127). With the
    example 59.889999:

    Clearly (-1)^s = 1, so s = 0.
    Since 2^5 <= 59.889999 < 2^6, e-127 equals 5, or e = 132.
    1.f * 2^5 = 59.889999, or 1.f = 59.889999 / 2^5.
    1.f is a multiple of 2^-23, so 2^23 * (1.f) is an integer.

    2^23 * 1.f is the integer nearest to (59.889999 / 2^5) * 2^23 which is
    about 15699803.898, so 2^23 * 1.f equals 15,699,804. Write this as a 24
    bit binary number; the first binary bit must be a 1.

    You need to be careful if you start with a value just below a power of
    two, like 63.999999: The value 2^23 * 1.f that you calculate is always
    less than 2^24, but rounding to the nearest integer could round it to
    2^24 which would make 1.f = 2. In that case you have to choose e one
    higher, and 1.f = 1.
     
    christian.bau, Dec 28, 2006
    #6
  7. "Dilip" <> wrote in message
    news:...
    >
    > Given a number 59.889999 how do I retrace back to what its binary
    > representation looks like?


    I'm assuming you can calculate the binary representation of the integer
    "59".

    Since you know where the radix point is, you also know the exponent and the
    number of bits you have after the radix point to express the fractional
    part.

    Assume you have 10 bits after the radix point ...

    Then you just solve:

    889999 / 1000000 = x / 2^10.

    and convert x to binary.
     
    David T. Ashley, Dec 28, 2006
    #7
  8. Dilip

    Chris Torek Guest

    In article <>
    Dilip <> wrote:
    >Considering I will probably run into many such problems in the future I
    >set out to study the IEEE 754 representation of these numbers. I have
    >understood it to an extent but some gaps remain. ...


    You have the basics down, it seems. Others have already covered
    the basics on conversions from decimal representations.

    My article at <http://web.torek.net/torek/c/numbers.html> is
    intentionally somewhat lightweight but may also be helpful. In
    particular, note the special cases for zero, "sub-normal" or
    "denormalized" numbers, Infinities, and NaNs.
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
     
    Chris Torek, Dec 28, 2006
    #8
  9. Dilip

    Ernie Wright Guest

    Dilip wrote:

    > Now my question is given how do I go in the reverse direction?
    >
    > Given a number 59.889999 how do I retrace back to what its binary
    > representation looks like?


    Despite the impression you might get from some of the answers, it's
    actually fairly easy to do this, and it doesn't require doing any binary
    arithmetic, or in fact anything you can't do pretty quickly with an
    ordinary calculator.

    Think of a floating-point number f as

    f = -1^s * (1 + m/2^b) * 2^e

    where s is a sign bit, m is (part of) the mantissa or significand, e is
    the exponent, and b is the number of bits of precision. For 32-bit IEEE
    floats, b is 23. We start out by finding s, m, and e and worry about
    how they fit into the binary representation later.

    1. Find the sign. Trivial, but by peeling this off first, the rest of
    the work can be done using the absolute value of f.

    s = SIGN(f) 1 if negative, otherwise 0
    f = |f|

    2. Find the base-2 log of f.

    Most calculators don't have a button for this, but it's easy to do with
    the log buttons they do have.

    log2(f) = log(f) / log(2) = 5.9042432031146...

    3. Round this DOWN to find the exponent e.

    e = 5

    This is the largest power of 2 that's less than f.

    4. To find the significand, divide f by 2^e.

    59.889999 / 32 = 1.87156246875

    At this point you have (the absolute value of) f in the form of a
    significand between 1.0 and 2.0, multiplied by a power of 2.

    1.87156246875 * 2^5

    Now you have to represent each component as an integer and pack them
    into 32 bits.

    5. The sign is the single bit s.

    6. The exponent is 8 bits equal to e + 127.

    7. The significand integer m is 23 bits:

    m = round(( significand - 1 ) * 2^23 )
    = round(( 1.87156246875 - 1 ) * 2^23 )
    = round( 7311195.897856 )
    = 7311196

    If, after rounding, m == 2^23, set m = 0 and increment e.

    At this point, you can check that you've got the right s, e, and m by
    plugging them back into the formula.

    1 * (1 + 7311196 / 8388608) * 32 = 59.8899993896484375

    8. Finally, pack these three integers into 32 bits.

    s << 31 | (e + 127) << 23 | m

    On a calculator, multiply and add instead of bit-shift and bitwise-or.

    s * 2^31 + ((e + 127) * 2^23) + m

    The procedure for doubles is nearly identical. Use b = 52 and build the
    bit pattern using

    s << 63 | (e + 1023) << 52 | m

    - Ernie http://home.comcast.net/~erniew
     
    Ernie Wright, Dec 28, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Motaz Saad
    Replies:
    7
    Views:
    6,554
  2. Replies:
    3
    Views:
    431
    Patricia Shanahan
    Jun 23, 2006
  3. Replies:
    10
    Views:
    2,874
    Torsten Bronger
    Dec 15, 2005
  4. suresh
    Replies:
    1
    Views:
    615
    Ulrich Eckhardt
    Jan 28, 2011
  5. Saraswati lakki
    Replies:
    0
    Views:
    1,417
    Saraswati lakki
    Jan 6, 2012
Loading...

Share This Page