What are the minimum numbers of bits required to represent C stdfloat and double components?

J

James Harris

I'm trying to make sense of the standard C data sizes for floating
point numbers. I guess the standards were written to accommodate some
particular floating point engines that were popular at one time but I
can only find references to the number of decimals etc. Basically, if
I wanted to specify C-sized reals in a language that only accepted bit-
widths, e.g.

float(exponent 8, mantissa 24)

I'm looking for what numbers would be needed for the exponent and
mantissa sizes to accurately mirror the C standard minimum widths. Not
sure my log calcs are correct......

AFAIK the sizes for real numbers must be at least

float: range 10^+/-37, precision 6 digits
double: range 10^+/-37, precision 10 digits

I think this means the number of bits used would be

float: 8 bits for exponent, 20 bits for mantissa
double: 8 bits for exponent, 33 bits for mantissa

These are much smaller than (and thus can be represented by) the
IEEE754 standard which has

ieee single precision: 8 bits for exponent, 24 bits for mantissa
ieee double precision: 11 bits for exponent, 53 bits for mantissa

In all cases the mantissa bits include the sign. Are my figures
correct for the number of bits needed for a mimimal C representation,
above? The double of 33 bits, especially, looks wrong.

Does the C standard specify /at least/ 10 digits of precision for
doubles or is it /about/ 10 digits? Or should it be at least /9/
digits and the mantissa 32 bits (making 40 in all with the exponent)?
 
I

Ian Collins

James said:
I'm trying to make sense of the standard C data sizes for floating
point numbers. I guess the standards were written to accommodate some
particular floating point engines that were popular at one time but I
can only find references to the number of decimals etc. Basically, if
I wanted to specify C-sized reals in a language that only accepted bit-
widths, e.g.

float(exponent 8, mantissa 24)

I'm looking for what numbers would be needed for the exponent and
mantissa sizes to accurately mirror the C standard minimum widths. Not
sure my log calcs are correct......
See the earlier thread "Can a double be 32 bits" and section 5.2.4.2.2
of the standard for more detail.
 
T

Thad Smith

James said:
I'm trying to make sense of the standard C data sizes for floating
point numbers. I guess the standards were written to accommodate some
particular floating point engines that were popular at one time but I
can only find references to the number of decimals etc. Basically, if
I wanted to specify C-sized reals in a language that only accepted bit-
widths, e.g.

float(exponent 8, mantissa 24)

I'm looking for what numbers would be needed for the exponent and
mantissa sizes to accurately mirror the C standard minimum widths. Not
sure my log calcs are correct......

AFAIK the sizes for real numbers must be at least

float: range 10^+/-37, precision 6 digits
double: range 10^+/-37, precision 10 digits

I think this means the number of bits used would be

float: 8 bits for exponent, 20 bits for mantissa
double: 8 bits for exponent, 33 bits for mantissa

You need more mantissa bits. In order to convert a fp number to 6 digits
and back, you need to resolve 1 part in 1e7, which requires 23.25 mantissa
bits. You barely get that with IEEE, which has an implied leading 1 bit.
You need 36.54 bits to get 10 digit precision.
 
B

Bartc

Thad said:
You need more mantissa bits. In order to convert a fp number to 6
digits and back, you need to resolve 1 part in 1e7, which requires
23.25 mantissa bits. You barely get that with IEEE, which has an
implied leading 1 bit. You need 36.54 bits to get 10 digit precision.

I don't get that. 6 digit precision is one part in a million. Single
precision reals (sorry, floats) usually give about 1 part in 8 million
precision, or rather more than six.

I think the OP could do with a sign bit too. That gives him 29 bits, with 3
bits spare in a 32-bit word to step the precision up from one in a million
to one in eight million. The implied '1' bit I don't think contributes to
the precision.
 
T

Thad Smith

Bartc said:
I don't get that. 6 digit precision is one part in a million. Single
precision reals (sorry, floats) usually give about 1 part in 8 million
precision, or rather more than six.

You are right. I have been confused on this issue more than once. :-(
Resolving 6 digits can be done in 1 part in 20 bits, which requires up to
21 value bits, depending on exponent. That would either be a 22 bit
mantissa, with sign, or 21 bits with sign and implied leading 1 bit.
I think the OP could do with a sign bit too. That gives him 29 bits, with 3
bits spare in a 32-bit word to step the precision up from one in a million
to one in eight million. The implied '1' bit I don't think contributes to
the precision.

It depends how you look at it. Without the implied 1 bit, you have one
less bit of precision for a given number of value bits. In that sense it
contributes to the precision.
 
J

James Harris

See the earlier thread "Can a double be 32 bits" and section 5.2.4.2.2
of the standard for more detail.

Thanks but I don't see how they help at all. The standard gives
figures in decimal (as in my original post). And the 32-bit thread
says doubles need 10 digits where the standard seems to say 15.

Was there something specific you were referring to in those
references? As mentioned the query is for the minimum-width /binary/
fields, i.e., I'm looking for the minimum numbers of bits for the
exponent and mantissa of C-standard floats and doubles?

I know the radix may affect this but thought not to add that
complication - at least not yet.
 
J

James Harris

....

....



You are right. I have been confused on this issue more than once. :-(
Resolving 6 digits can be done in 1 part in 20 bits, which requires up to
21 value bits, depending on exponent. That would either be a 22 bit
mantissa, with sign, or 21 bits with sign and implied leading 1 bit.


It depends how you look at it. Without the implied 1 bit, you have one
less bit of precision for a given number of value bits. In that sense it
contributes to the precision.

Since it is always 1 should we say it contributes to the storage space
(reduces it by one bit) but does not affect the precision?

Is it true, then, that (I'll change from the previous inclusive sign
bit to specifying the sign bit separately as is more familiar) a
standard C implementation requires at least

float : 1 sign bit, 8??? exponent bits, 20 stored mantissa bits
double: 1 sign bit, 8??? exponent bits, 34 stored mantissa bits

and that the implementation can use one more 'virtual' mantissa bit if
it wishes, to enhance the range but not the precision?

The rationalle for the mantissa sizes:

log10(2 ** 19) = 5.7195699176156429
log10(2 ** 20) = 6.0205999132796242 <-- needed for float

log10(2 ** 33) = 9.9339898569113796
log10(2 ** 34) = 10.235019852575361 <-- needed for double
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top