What are the minimum numbers of bits required to represent C stdfloat and double components?

James Harris · Apr 19, 2008

I'm trying to make sense of the standard C data sizes for floating
point numbers. I guess the standards were written to accommodate some
particular floating point engines that were popular at one time but I
can only find references to the number of decimals etc. Basically, if
I wanted to specify C-sized reals in a language that only accepted bit-
widths, e.g.

float(exponent 8, mantissa 24)

I'm looking for what numbers would be needed for the exponent and
mantissa sizes to accurately mirror the C standard minimum widths. Not
sure my log calcs are correct......

AFAIK the sizes for real numbers must be at least

float: range 10^+/-37, precision 6 digits
double: range 10^+/-37, precision 10 digits

I think this means the number of bits used would be

float: 8 bits for exponent, 20 bits for mantissa
double: 8 bits for exponent, 33 bits for mantissa

These are much smaller than (and thus can be represented by) the
IEEE754 standard which has

ieee single precision: 8 bits for exponent, 24 bits for mantissa
ieee double precision: 11 bits for exponent, 53 bits for mantissa

In all cases the mantissa bits include the sign. Are my figures
correct for the number of bits needed for a mimimal C representation,
above? The double of 33 bits, especially, looks wrong.

Does the C standard specify /at least/ 10 digits of precision for
doubles or is it /about/ 10 digits? Or should it be at least /9/
digits and the mantissa 32 bits (making 40 in all with the exponent)?

Ian Collins · Apr 19, 2008

James said:
I'm trying to make sense of the standard C data sizes for floating
point numbers. I guess the standards were written to accommodate some
particular floating point engines that were popular at one time but I
can only find references to the number of decimals etc. Basically, if
I wanted to specify C-sized reals in a language that only accepted bit-
widths, e.g.

float(exponent 8, mantissa 24)

I'm looking for what numbers would be needed for the exponent and
mantissa sizes to accurately mirror the C standard minimum widths. Not
sure my log calcs are correct......

See the earlier thread "Can a double be 32 bits" and section 5.2.4.2.2
of the standard for more detail.

Thad Smith · Apr 19, 2008

James said:
I'm trying to make sense of the standard C data sizes for floating
point numbers. I guess the standards were written to accommodate some
particular floating point engines that were popular at one time but I
can only find references to the number of decimals etc. Basically, if
I wanted to specify C-sized reals in a language that only accepted bit-
widths, e.g.

float(exponent 8, mantissa 24)

I'm looking for what numbers would be needed for the exponent and
mantissa sizes to accurately mirror the C standard minimum widths. Not
sure my log calcs are correct......

AFAIK the sizes for real numbers must be at least

float: range 10^+/-37, precision 6 digits
double: range 10^+/-37, precision 10 digits

I think this means the number of bits used would be

float: 8 bits for exponent, 20 bits for mantissa
double: 8 bits for exponent, 33 bits for mantissa

You need more mantissa bits. In order to convert a fp number to 6 digits
and back, you need to resolve 1 part in 1e7, which requires 23.25 mantissa
bits. You barely get that with IEEE, which has an implied leading 1 bit.
You need 36.54 bits to get 10 digit precision.

Bartc · Apr 19, 2008

Thad said:
You need more mantissa bits. In order to convert a fp number to 6
digits and back, you need to resolve 1 part in 1e7, which requires
23.25 mantissa bits. You barely get that with IEEE, which has an
implied leading 1 bit. You need 36.54 bits to get 10 digit precision.

I don't get that. 6 digit precision is one part in a million. Single
precision reals (sorry, floats) usually give about 1 part in 8 million
precision, or rather more than six.

I think the OP could do with a sign bit too. That gives him 29 bits, with 3
bits spare in a 32-bit word to step the precision up from one in a million
to one in eight million. The implied '1' bit I don't think contributes to
the precision.

Thad Smith · Apr 19, 2008

Bartc said:
I don't get that. 6 digit precision is one part in a million. Single
precision reals (sorry, floats) usually give about 1 part in 8 million
precision, or rather more than six.

You are right. I have been confused on this issue more than once. :-(
Resolving 6 digits can be done in 1 part in 20 bits, which requires up to
21 value bits, depending on exponent. That would either be a 22 bit
mantissa, with sign, or 21 bits with sign and implied leading 1 bit.

I think the OP could do with a sign bit too. That gives him 29 bits, with 3
bits spare in a 32-bit word to step the precision up from one in a million
to one in eight million. The implied '1' bit I don't think contributes to
the precision.

It depends how you look at it. Without the implied 1 bit, you have one
less bit of precision for a given number of value bits. In that sense it
contributes to the precision.

James Harris · Apr 21, 2008

See the earlier thread "Can a double be 32 bits" and section 5.2.4.2.2
of the standard for more detail.

Thanks but I don't see how they help at all. The standard gives
figures in decimal (as in my original post). And the 32-bit thread
says doubles need 10 digits where the standard seems to say 15.

Was there something specific you were referring to in those
references? As mentioned the query is for the minimum-width /binary/
fields, i.e., I'm looking for the minimum numbers of bits for the
exponent and mantissa of C-standard floats and doubles?

I know the radix may affect this but thought not to add that
complication - at least not yet.

James Harris · Apr 21, 2008

....

....

You are right. I have been confused on this issue more than once. :-(
Resolving 6 digits can be done in 1 part in 20 bits, which requires up to
21 value bits, depending on exponent. That would either be a 22 bit
mantissa, with sign, or 21 bits with sign and implied leading 1 bit.

It depends how you look at it. Without the implied 1 bit, you have one
less bit of precision for a given number of value bits. In that sense it
contributes to the precision.

Since it is always 1 should we say it contributes to the storage space
(reduces it by one bit) but does not affect the precision?

Is it true, then, that (I'll change from the previous inclusive sign
bit to specifying the sign bit separately as is more familiar) a
standard C implementation requires at least

float : 1 sign bit, 8??? exponent bits, 20 stored mantissa bits
double: 1 sign bit, 8??? exponent bits, 34 stored mantissa bits

and that the implementation can use one more 'virtual' mantissa bit if
it wishes, to enhance the range but not the precision?

The rationalle for the mantissa sizes:

log10(2 ** 19) = 5.7195699176156429
log10(2 ** 20) = 6.0205999132796242 <-- needed for float

log10(2 ** 33) = 9.9339898569113796
log10(2 ** 34) = 10.235019852575361 <-- needed for double

RNGs: A double KISS	10	Apr 14, 2010
printing bits ... the right way	2	Apr 1, 2010
double precise enough for unsigned short?	11	May 31, 2010
any floating point library with more exponent bits?	6	May 15, 2006
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
the need for 64 bits	12	Dec 28, 2009
are int, float, long, double, side-effects of computer engineering?	15	Mar 6, 2012
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023

What are the minimum numbers of bits required to represent C stdfloat and double components?

James Harris

Ian Collins

Thad Smith

Bartc

Thad Smith

James Harris

James Harris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads