Derivation of the C Standard's Formula for FLT_DIG, DBL_DIG, LDBL_DIG

John Reye · Jul 7, 2012

Hi,

have you seen the C Standards formula for FLT_DIG, DBL_DIG, LDBL_DIG?

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=49 (pdf-
page 49)

If you have a capable browser, it can be displayed like this:

âŽ§ p log10 b if b is a power of 10
q = âŽ¨
âŽ© âŽ£( p âˆ’ 1) log10 bâŽ¦ otherwise

How does one derive this formula? (And: is it correct, in the sense of
really defining the maximum for q ?)

I'm particularly interested in the "otherwise" case, which states q <=
floor((p-1)*log10(b))

In particular: cannot q be somewhat larger, as in q <=
floor(p*log10(b)) = LIM ??? I believe that also results in no rounding
loss; and an exact representation (of the original decimal number)
with p radix-b digits.

(((In fact... q might even be able to be larger still (than LIM), for
cases where there is a non-exact representation of the original number
with p radix-b digits; but rounding back to base-10, would still yield
the original number.
But pondering on it... I think this case is not possible, since we are
seeking the maximum of q... and any non-exact representation, would
only makes sense if it provides "more digits", i.e. more resolution...
but this contradicts the search for the maximum q. Hence I currently
suspect LIM to be the maximum for q.
)))

Thanks.
J.

John Reye · Jul 7, 2012

Hi,

have you seen the C Standards formula for FLT_DIG, DBL_DIG, LDBL_DIG?

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=49(pdf-
page 49)

If you have a capable browser, it can be displayed like this:

Â Â Â âŽ§ Â Â p log10 b Â Â Â Â Â Â Â if b is a power of 10
q = Â âŽ¨
Â Â Â âŽ© Â Â âŽ£( p âˆ’ 1) log10 bâŽ¦ Â Â Â otherwise

How does one derive this formula? (And: is it correct, in the sense of
really defining the maximum for q ?)

I'm particularly interested in the "otherwise" case, which states q <=
floor((p-1)*log10(b))

In particular: cannot q be somewhat larger, as in q <=
floor(p*log10(b)) = LIM ??? I believe that also results in no rounding
loss; and an exact representation (of the original decimal number)
with p radix-b digits.

Yip, I'm pretty sure I'm correct here.

Here's a proof by example.
Consider the following two cases:
a) b = 100
b) b = 101

In a):
Here b = 100; then:
q <= p*log10(b) = p*2

In b):
Here b = 101; then:
According to the C standard:
q <= floor((p-1)*log10(b)) = (p-1)*2 = p*2 - 2

Note the -2 !!

This shows how ludicrous the formula in the C standard actually is!
If you increase the resolution of the p digits, by increasing b by
1... we end up getting 2 less "base-10-digits" (i.e. 2 less q)!!
This is simply false.

The correct max of q (as far as I can see) is given by
q <= floor(p*log10(b)) = p*2
Note: no ridiculous -2 here!

Ben Bacarisse · Jul 7, 2012

John Reye said:
Yip, I'm pretty sure I'm correct here.

Here's a proof by example.
Consider the following two cases:
a) b = 100
b) b = 101

In a):
Here b = 100; then:
q <= p*log10(b) = p*2

In b):
Here b = 101; then:
According to the C standard:
q <= floor((p-1)*log10(b)) = (p-1)*2 = p*2 - 2

Note the -2 !!

This shows how ludicrous the formula in the C standard actually is!

For it to be wrong (let alone ludicrous) you'd have to show that all
numbers with q > floor((p-1)*log10(b)) decimal digits can be "rounded
into a floating-point number [...] and back again without change [...]".

For example pick, say, p=2. q is 2 meaning that there are 3-digit
decimal numbers that can't be converted to and from 2-digit base 101
floating point representation. Consider 100 -- how do you represent 100
as a 2-digit base 101 number so that it converts back to 100?

If you increase the resolution of the p digits, by increasing b by
1... we end up getting 2 less "base-10-digits" (i.e. 2 less q)!!
This is simply false.

That needs a proof, not a blank statement.

The correct max of q (as far as I can see) is given by
q <= floor(p*log10(b)) = p*2
Note: no ridiculous -2 here!

It's not true that floor(p*log10(b)) = p*2. It's not true even in the
specific case when b = 101.

John Reye · Jul 7, 2012

For it to be wrong (let alone ludicrous) you'd have to show that all
numbers with q > floor((p-1)*log10(b)) decimal digits can be "rounded
into a floating-point number [...] and back again without change [...]".

Wrong. I'm claiming it for all positive integer numbers, q where q <=
floor(p*log10(b))

For example pick, say, p=2. q is 2 meaning that there are 3-digit
decimal numbers that can't be converted to and from 2-digit base 101
floating point representation. Consider 100 -- how do you represent 100
as a 2-digit base 101 number so that it converts back to 100?

What about 0G, where G is the single base-101 digit representing 100?
Simple!!

That needs a proof, not a blank statement.

Well I have a proof. Maby if I find a nice math2ascii (or
math2unicode) converter, I'll show it.

It's not true that floor(p*log10(b)) = p*2. It's not true even in the
specific case when b = 101.

I think the previous post is clear enough, for showing a simple
counter-example.
But yes, if you must have it, I can add that floor(p*log10(101)) =
p*2,
for p <= 100

John Reye · Jul 7, 2012

On Jul 7 said:
On Jul 7 said:

numbers with q > floor((p-1)*log10(b)) decimal digits can be "rounded
into a floating-point number [...] and back again without change [...]"..

Click to expand...

Wrong. I'm claiming it for all positive integer numbers, q
where q <= floor(p*log10(b))

I'm no longer sure of this statement, since I've found something that
I forgot in my proof-attempt.

I'll work at it some more and then post the results as good as I get
them...
Maby someone can then fill in the missing details, and maby we can
proof the C Standard's formula to be true.

John Reye · Jul 7, 2012

OK here's my attempt at a proof, that
q <= p*log10(b)

I hope I've considered everything. Maby I made some careless error
that invalidates it. Please explain it to me, if you find one!
Otherwise... here we go:

The decimal number is:

q
====
\ -i
X = > ( G * 10 )
/ i
====
i=1

The float number representation will have the form:

p
====
\ -k
Y = > ( F * b )
/ k
====
k=1

To round z to the n'th digit after the comma, for a number having base
r, use:

Round(n, r, z)

=
n n
floor(z * r + 0.5) / (r )

So what now?

We round X to the form Y.
Like this:
Y = Round(p, b, X)

And then we seek the maximum q, such that when rounding back, the
following holds true:

Round(q, 10, Round(p, b, X)) = X

|
\
= Y

Anyway, lets look at the rounding of X to the form Y:

Y = Round(p, b, X)
=

q
====
\ -i p p
floor( > ( G * 10 ) * b + 0.5) / (b ))
/ i
====
i=1

Now have a look at the following SUBEXPRESSION from above

-i p
( G * 10 ) * b
i

If (for the subexpression) all G digits ...
i

....end up being left of the comma, then we can basically ignore
"adding 0.5 and the floor operation" above.

So lets do that:
Let's get the rightmost digit
G
q
....to end up being left of the comma (in base-b representation)!
(we're considering the subexpression here!)

-q p
This is when 10 * b >= 1 .

THIS IS THE MAIN CONDITION!!

If this holds true, then X can be represented exactly as Y, and vice
versa.
(Question to experts: is this guaranteed to be true?)

Now simply solve the MAIN CONDITION, for q:

q <= p*log10(b)

One hmmm, what if the G - digits when converted to base-b digits
(SUBEXPRESSION), end up having more than p digits!???
That's something that I have not considered previously.

The maximum possible value of SUBEXPRESSION, in base-b format is
defined by p base-b digits: b^p - 1
So what happens if we insert the maximum base-10 format, which is when
G is always 9:
Our subexpression becomes:
-i p
( 9 * 10 ) * b

If we sum it over the q digits, then we get
(10^q - 1) / 10^q * b^p

We must have b^p - 1 >= (10^q - 1) / 10^q * b^p
in order for X to be representable as Y and not shoot out over Y's
range!

OK, now solve for q and (woha!) we're safe:
q <= p*log10(b)

So in both cases I get
q <= p*log10(b)

If this holds true, then X can be exactly expressed as Y, and it can
then obviously be exactly represented as (rounded to) X again.

Please tell me if there's something important that I've missed.
Thanks for your help.
J.

Tim Rentsch · Jul 7, 2012

John Reye said:
Hi,

have you seen the C Standards formula for FLT_DIG, DBL_DIG, LDBL_DIG?

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=49
(pdf- page 49)

If you have a capable browser, it can be displayed like this: [snip]

Possibly true, but many people (myself included) don't read news
postings in a browser. Just fyi.

How does one derive this formula? (And: is it correct, in the
sense of really defining the maximum for q ?)

I don't know how to derive the formula, nor do I know how to go
about deriving it. Despite that, I believe the formula is
correct.

I'm particularly interested in the "otherwise" case, which
states q <= floor((p-1)*log10(b))

In particular: cannot q be somewhat larger, as in q <=
floor(p*log10(b)) = LIM ??? I believe that also results in no
rounding loss; and an exact representation (of the original
decimal number) with p radix-b digits. [snip elaboration]

There may be particular cases where q can be larger than the
Standard's formula dictates, but certainly it cannot be in all
cases. For example, consider the floating value 97e-4. Naively,
we might guess seven bits would be enough for conversion of all
two decimal digit FP numbers (because 10**2 < 2**7). That would
correspond to q (==2) = floor (7 * log10 2). However, seven bits
is not enough. Calculating with 0.0097 = 0.97 * 10**-2, we get:

0.97 * 10**-2 = 0.000 000 100 111 101 ... (base 2)
= 0.100 111 101 ... (base 2) * 2**-6

Rounding the right hand side to 7 bits (after the binary
point) gives

= 0.100 111 1 (base 2) * 2**-6

which is approximately equal to 0.009643, or

0.9643 * 10**-2

which when rounded to two decimal places is 0.0096, and
not 0.0097. So here we have an example where the
formula q = floor(p * log10(b)) does not work for p=7, but
the formula q = floor( (p-1) * log10(b) ) does work for
p=8; and indeed, using eight bits does suffice to convert
0.0097 to binary and back again (rounding in both cases)
and get 0.0097 back again.

Ben Bacarisse · Jul 7, 2012

John Reye said:
For it to be wrong (let alone ludicrous) you'd have to show that all
numbers with q > floor((p-1)*log10(b)) decimal digits can be "rounded
into a floating-point number [...] and back again without change [...]".

Click to expand...

Wrong. I'm claiming it for all positive integer numbers, q where q <=
floor(p*log10(b))

Yes you are but that's not my point. My point was, though very poorly
stated, that for the formula to be wrong you have to show that there is
*some* q > floor((p-1)*log10(b)) for which all q-digit numbers can be
round-trip converted. Can you show such a thing?

What about 0G, where G is the single base-101 digit representing 100?
Simple!!

True. And it's obviously true for all integers between 0 and 9999. If
there is a problem it will be with fractional numbers or large numbers.

Well I have a proof. Maby if I find a nice math2ascii (or
math2unicode) converter, I'll show it.

I think the previous post is clear enough, for showing a simple
counter-example.

Well, that was my point. I did not see a counter example just a
statement that the bound in the standard is ludicrous.

To make my search for a counter example easier, lets consider 2-digit
base 32 floating point numbers. (In base 32, the difference between the
standard's formula and yours is exaggerated when p=2.)

Your formula says that all floating point numbers with three decimal
digits can be converted to a 2-digit base 32 floating point number (and
back) without change. I don't think this is true for 6.24 x 10^-2
(0.0624). The closest base 32 representation is

0.(2)(0) = 2/32 = 0.0625

and the next smallest is

0.(1)(31) = 1/32 + 31/1024 = 0.061523

To take a non-fractional example, 203 x 10^1 would be represented as

(1)(31) x 32^1 = 1*32*32 + 31*32 = 2016

The round-trip conversion will yield 202 x 10^1. The next largest is
2*32*32 = 2048 which rounds to 205 x 10^1.

<snip>

John Reye · Jul 7, 2012

Thanks Tim and Ben for your good replies!

I've found a big flaw in my "proof" above.
I never consider the exponent!

But it is vital, particularly when b is not a power of 10.

The reason is: the exponent, changes the bit-pattern of the machine
representation!!

Example (from Tim):
0.97 * 10**-2 = 0.000 000 100 111 101 ... (base 2)
= 0.100 111 101 ... (base 2) * 2**-6

But
0.97 * 10**0 = 0.111 110 000 ... (base 2)

The important thing to note is that the binary bit pattern in the
mantissa is completely different!

I'm hoping someone can come up with a nice exact proof for the limit
on q.
(Maby the flawed proof can be a starting point, or inspiration).

Thanks.

John Reye · Jul 8, 2012

Hi there!

Here is a detailed analysis of the problem.
I end up with something I term "THE MAIN RESULT" below.
A good discrete mathematician can probably use this to determine the
exact tightest highest bound for q.

The bound given in the C standard is not the tighest max. bound.
I believe I've shown this in the 2nd post on this thread (dated Sat, 7
Jul 2012 06:50:02 -0700 (PDT)).

Anyway... below I also go ahead and proof the bound found in the C
standard.
Here we go:

The decimal number is:

X = Xmantissa * 10^ex

q
====
\ -i
X = > ( G * 10 ) * 10^ex
/ i
====
i=1

The constraint on Xmantissa and then X is:

10^(-1) <= Xmantissa < 1

Thus since X = Xmantissa * 10^ex, we get

10^(ex-1) <= X < 10^e

We want to represent X as a floating point number with radix (base) b.
We'll call that representation Y.

Y will have this form:

Y = Ymantissa * b^ey

p
====
\ -k
Y = > ( F * b ) * b^ey
/ k
====
k=1

The constraint on Y is:

b^(-1) <= Ymantissa < 1

Thus since Y = Ymantissa * b^ey, we get

b^(ey-1) <= Y < b^ey

How do we get X into the form Y, with the constraints on the mantissa.
Well, we have to scale X, by multiplication by b^n, such that we get
the standard form for Ymantissa, which corresponds to its constraint
of
b^(-1) <= Ymantissa < 1

Thus
b^(-1) <= X * b^n < 1
Solving for n:
-1 + log_b(X^(-1)) <= n < log_b(X^(-1))

Since n must be an integer, we set:
n = ceil(-1 + log_b(X^(-1))

Thus X in terms of Y becomes:
Yexact = (X * b^n) * b^(-n)
---------
Ymantissa

But this is an exact Y, and we want a true floating point Y, with it's
p digits (base b) or precision.

We'll use this rounding scheme:
To round z to the m'th digit after the comma, for a number having base
r, use:

Round(m, r, z)

=
m m
floor(z * r + 0.5) / (r )

For rounding X to Y we'll do:
Y = Round(p, b, X*b^n)*b^(-n)
-----

The line shows our Ymantissa = X*b^n, which needs to undergo rounding,
to the correct number of digits, p.
For simplicity, let's call
Round(p, b, X*b^n) = YmantissaRound

For rounding back to form X, we need to scale Y according the the
constraint of Xmantissa, and then round to the q digits.
We have to scale Y by multiplication of 10^something, such that
10^(-1) <= Xmantissa < 1
We know that we'll scale by 10^(-ex) since this will be consistant,
with the very first equation of this proof.
Thus
XmantissaBack = (Y * 10^(-ex))
so that (compare with 1st equation of proof):
Xback = (Y * 10^(-ex)) * 10^ex

So the rounding of X to Y and back to Xback, will be:
Xback = Round(q, 10, XmantissaBack) * 10^ex
or
Xback = Round(q, 10, (Y * 10^(-ex))) * 10^ex
or
Xback = Round(q, 10, (YmantissaRound * b^(-n) * 10^(-ex))) * 10^ex
or
Xback = Round(q, 10, (Round(p, b, X*b^n) * b^(-n) * 10^(-ex))) * 10^ex

Ultimately, we seek the maximum q such that the above line, when
evaluated yields:
Xback == X

Thus we seek the maximum q, such that:

Round(q, 10, (Round(p, b, X*b^n) * b^(-n) * 10^(-ex))) * 10^ex == X

where
n = ceil(-1 + log_b(X^(-1))
over all possible variations in X (according to our format of X) and
over all possible variations in ex (according the the limits of our
exponent ex_MIN <= ex <= ex_MAX)

i.e.
**********THE MAIN RESULT******************************
We seek the maximum q, such that:

(floor(((floor(X*b^n * b^p + 0.5) / (b^p)) * b^(-n) * 10^(-ex)) * 10^q
+ 0.5) / (10^q)) * 10^ex == X

where
n = ceil(-1 + log_b(X^(-1))
over all possible variations in X (according to our format of X) and
over all possible variations in ex (according the the limits of our
exponent ex_MIN <= ex <= ex_MAX)

I suspect that the exact solution to this problem, is something for
the good discrete mathematician.

How can one simplify it?
a)
I've found some interesting constaints.

b)
I've found a way to find the bound for q, as given by the C standard.
However, this is not the best possible (i.e. max) bound. (If you
mangage to work out a tighter larger bound, please share the details,
if you like

)

Anyway... here goes:
a)

X is the sum of coefficients. Lets just consider the least significant
coefficient of X which is:

-q ex
G * 10 * 10 = G_q * 10^(ex-q)
q

Putting it into the formula:

(floor(((floor(G_q*10^(ex-q)*b^n * b^p + 0.5) / (b^p)) * b^(-n) * 10^(-
ex)) * 10^q + 0.5) / (10^q)) * 10^ex == G_q*10^(ex-q)

Then set G_q to it's lowest value, which is 1. (Reason: we seek the
limit, where the floor cuts off important stuff. For this, we need to
consider the lowest positive value.)

This then simplfies to:

(floor(((floor(10^(ex-q)*b^n * b^p + 0.5) / (b^p)) * b^(-n) * 10^(-
ex)) * 10^q + 0.5)) == 1

Call this the same as:
(floor(A + 0.5)) == 1

Then we have
0.5 <= A < 1.5

Then
0.5 * b^(n+p) * 10^(ex-q) <= floor(b^(n+p) * 10^(ex-q) + 0.5) < 1.5 *
b^(n+p) * 10^(ex-q)

Call this:
0.5 * S <= floor(S + 0.5) < 1.5 * S

The middle floor must yield an integer, so the following constraint
(considering the outer bounds) must hold:
ceil(0.5 * S) <= S + 0.5 <= floor(1.5 * S)

b)
Here's how to prove the C standard's bound on q.

Consider the following SUBEXPRESSION, from THE MAIN RESULT (see above)
(floor(X*b^n * b^p + 0.5) / (b^p))

The least significant coefficient of X is:
G_q*10^(ex-q)

Set G_q to 1 (the lowest possible value)
Then the contribution of this least coefficient to the subexpression
is:

(floor(10^(ex-q)*b^n * b^p + 0.5) / (b^p))

For the contribution to be visible and rounded correctly (and not
chopped off by floor), we require:

10^(ex-q)*b^n * b^p >= 1
Call this the CONTRIBUTION_REQUIREMENT.

To handle n, we'll need to consider the 2 mantissa constraints, which
were introduced almost right at the top:

-1 + log_b(X^(-1)) <= n < log_b(X^(-1))
10^(ex-1) <= X < 10^e

We can use the 2nd constraint to determine the range of
log_b(X^(-1)) which appears in the 1st constraint.

Applying ()^-1 and then log_b to the 2nd constraint yields:
log_b(10^(-ex+1)) >= log_b(X^(-1)) > log_b(10^(-ex))
which - after reformatting - is:
log_b(10^(-ex)) < log_b(X^(-1)) <= log_b(10^(-ex+1))

This allows us to get a range for the 1st constraint, i.e. a range for
n:

-1 + log_b(10^(-ex)) < -1 + log_b(X^(-1)) <= n < log_b(X^(-1)) <=
log_b(10^(-ex+1))

Now we'll apply this bound to the CONTRIBUTION_REQUIREMENT. We use the
lower bound (which is the "worst case" for the
CONTRIBUTION_REQUIREMENT) which will yield the value for q, which
*guarantees* the CONTRIBUTION_REQUIREMENT

Thus
10^(ex-q)*b^n * b^p >= 1
becomes
10^(ex-q)*b^(-1 + log_b(10^(-ex))) * b^p >= 1
or
10^(ex-q)*b^(-1) * b^(log_b(10^(-ex))) * b^p >= 1
or
10^(ex-q)*b^(-1) * 10^(-ex) * b^p >= 1
or
10^(-q)*b^(-1+p) >= 1
or
q <= log10(b^(-1+p)) = (p-1)*log10(b)

So:
q <= (p-1)*log10(b)

which proves the bound found in the C standard:
q <= floor((p-1)*log10(b))

Regards,
J.

Derivation of the C Standard's Formula for FLT_DIG, DBL_DIG, LDBL_DIG	0	Jul 7, 2012
The CMWC4827 RNG, an improvement on MWC4691.	12	Oct 22, 2010
Errata for The C Programming Language, Second Edition, by Brian Kernighanand Dennis Ritchie	4	May 16, 2009
On the development of C	211	Mar 9, 2009
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
making the keys of a hash from parameters collected from a form	7	Nov 28, 2007
[ANN] Multidimensional Array - MDArray 0.5.5	0	Nov 19, 2013
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Derivation of the C Standard's Formula for FLT_DIG, DBL_DIG, LDBL_DIG

John Reye

John Reye

Ben Bacarisse

John Reye

John Reye

John Reye

Tim Rentsch

Ben Bacarisse

John Reye

John Reye

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads