Hi there!
Here is a detailed analysis of the problem.
I end up with something I term "THE MAIN RESULT" below.
A good discrete mathematician can probably use this to determine the
exact tightest highest bound for q.
The bound given in the C standard is not the tighest max. bound.
I believe I've shown this in the 2nd post on this thread (dated Sat, 7
Jul 2012 06:50:02 -0700 (PDT)).
Anyway... below I also go ahead and proof the bound found in the C
standard.
Here we go:
The decimal number is:
X = Xmantissa * 10^ex
q
====
\ -i
X = > ( G * 10 ) * 10^ex
/ i
====
i=1
The constraint on Xmantissa and then X is:
10^(-1) <= Xmantissa < 1
Thus since X = Xmantissa * 10^ex, we get
10^(ex-1) <= X < 10^e
We want to represent X as a floating point number with radix (base) b.
We'll call that representation Y.
Y will have this form:
Y = Ymantissa * b^ey
p
====
\ -k
Y = > ( F * b ) * b^ey
/ k
====
k=1
The constraint on Y is:
b^(-1) <= Ymantissa < 1
Thus since Y = Ymantissa * b^ey, we get
b^(ey-1) <= Y < b^ey
How do we get X into the form Y, with the constraints on the mantissa.
Well, we have to scale X, by multiplication by b^n, such that we get
the standard form for Ymantissa, which corresponds to its constraint
of
b^(-1) <= Ymantissa < 1
Thus
b^(-1) <= X * b^n < 1
Solving for n:
-1 + log_b(X^(-1)) <= n < log_b(X^(-1))
Since n must be an integer, we set:
n = ceil(-1 + log_b(X^(-1))
Thus X in terms of Y becomes:
Yexact = (X * b^n) * b^(-n)
---------
Ymantissa
But this is an exact Y, and we want a true floating point Y, with it's
p digits (base b) or precision.
We'll use this rounding scheme:
To round z to the m'th digit after the comma, for a number having base
r, use:
Round(m, r, z)
=
m m
floor(z * r + 0.5) / (r )
For rounding X to Y we'll do:
Y = Round(p, b, X*b^n)*b^(-n)
-----
The line shows our Ymantissa = X*b^n, which needs to undergo rounding,
to the correct number of digits, p.
For simplicity, let's call
Round(p, b, X*b^n) = YmantissaRound
For rounding back to form X, we need to scale Y according the the
constraint of Xmantissa, and then round to the q digits.
We have to scale Y by multiplication of 10^something, such that
10^(-1) <= Xmantissa < 1
We know that we'll scale by 10^(-ex) since this will be consistant,
with the very first equation of this proof.
Thus
XmantissaBack = (Y * 10^(-ex))
so that (compare with 1st equation of proof):
Xback = (Y * 10^(-ex)) * 10^ex
So the rounding of X to Y and back to Xback, will be:
Xback = Round(q, 10, XmantissaBack) * 10^ex
or
Xback = Round(q, 10, (Y * 10^(-ex))) * 10^ex
or
Xback = Round(q, 10, (YmantissaRound * b^(-n) * 10^(-ex))) * 10^ex
or
Xback = Round(q, 10, (Round(p, b, X*b^n) * b^(-n) * 10^(-ex))) * 10^ex
Ultimately, we seek the maximum q such that the above line, when
evaluated yields:
Xback == X
Thus we seek the maximum q, such that:
Round(q, 10, (Round(p, b, X*b^n) * b^(-n) * 10^(-ex))) * 10^ex == X
where
n = ceil(-1 + log_b(X^(-1))
over all possible variations in X (according to our format of X) and
over all possible variations in ex (according the the limits of our
exponent ex_MIN <= ex <= ex_MAX)
i.e.
**********THE MAIN RESULT******************************
We seek the maximum q, such that:
(floor(((floor(X*b^n * b^p + 0.5) / (b^p)) * b^(-n) * 10^(-ex)) * 10^q
+ 0.5) / (10^q)) * 10^ex == X
where
n = ceil(-1 + log_b(X^(-1))
over all possible variations in X (according to our format of X) and
over all possible variations in ex (according the the limits of our
exponent ex_MIN <= ex <= ex_MAX)
I suspect that the exact solution to this problem, is something for
the good discrete mathematician.
How can one simplify it?
a)
I've found some interesting constaints.
b)
I've found a way to find the bound for q, as given by the C standard.
However, this is not the best possible (i.e. max) bound. (If you
mangage to work out a tighter larger bound, please share the details,
if you like

)
Anyway... here goes:
a)
X is the sum of coefficients. Lets just consider the least significant
coefficient of X which is:
-q ex
G * 10 * 10 = G_q * 10^(ex-q)
q
Putting it into the formula:
(floor(((floor(G_q*10^(ex-q)*b^n * b^p + 0.5) / (b^p)) * b^(-n) * 10^(-
ex)) * 10^q + 0.5) / (10^q)) * 10^ex == G_q*10^(ex-q)
Then set G_q to it's lowest value, which is 1. (Reason: we seek the
limit, where the floor cuts off important stuff. For this, we need to
consider the lowest positive value.)
This then simplfies to:
(floor(((floor(10^(ex-q)*b^n * b^p + 0.5) / (b^p)) * b^(-n) * 10^(-
ex)) * 10^q + 0.5)) == 1
Call this the same as:
(floor(A + 0.5)) == 1
Then we have
0.5 <= A < 1.5
Then
0.5 * b^(n+p) * 10^(ex-q) <= floor(b^(n+p) * 10^(ex-q) + 0.5) < 1.5 *
b^(n+p) * 10^(ex-q)
Call this:
0.5 * S <= floor(S + 0.5) < 1.5 * S
The middle floor must yield an integer, so the following constraint
(considering the outer bounds) must hold:
ceil(0.5 * S) <= S + 0.5 <= floor(1.5 * S)
b)
Here's how to prove the C standard's bound on q.
Consider the following SUBEXPRESSION, from THE MAIN RESULT (see above)
(floor(X*b^n * b^p + 0.5) / (b^p))
The least significant coefficient of X is:
G_q*10^(ex-q)
Set G_q to 1 (the lowest possible value)
Then the contribution of this least coefficient to the subexpression
is:
(floor(10^(ex-q)*b^n * b^p + 0.5) / (b^p))
For the contribution to be visible and rounded correctly (and not
chopped off by floor), we require:
10^(ex-q)*b^n * b^p >= 1
Call this the CONTRIBUTION_REQUIREMENT.
To handle n, we'll need to consider the 2 mantissa constraints, which
were introduced almost right at the top:
-1 + log_b(X^(-1)) <= n < log_b(X^(-1))
10^(ex-1) <= X < 10^e
We can use the 2nd constraint to determine the range of
log_b(X^(-1)) which appears in the 1st constraint.
Applying ()^-1 and then log_b to the 2nd constraint yields:
log_b(10^(-ex+1)) >= log_b(X^(-1)) > log_b(10^(-ex))
which - after reformatting - is:
log_b(10^(-ex)) < log_b(X^(-1)) <= log_b(10^(-ex+1))
This allows us to get a range for the 1st constraint, i.e. a range for
n:
-1 + log_b(10^(-ex)) < -1 + log_b(X^(-1)) <= n < log_b(X^(-1)) <=
log_b(10^(-ex+1))
Now we'll apply this bound to the CONTRIBUTION_REQUIREMENT. We use the
lower bound (which is the "worst case" for the
CONTRIBUTION_REQUIREMENT) which will yield the value for q, which
*guarantees* the CONTRIBUTION_REQUIREMENT
Thus
10^(ex-q)*b^n * b^p >= 1
becomes
10^(ex-q)*b^(-1 + log_b(10^(-ex))) * b^p >= 1
or
10^(ex-q)*b^(-1) * b^(log_b(10^(-ex))) * b^p >= 1
or
10^(ex-q)*b^(-1) * 10^(-ex) * b^p >= 1
or
10^(-q)*b^(-1+p) >= 1
or
q <= log10(b^(-1+p)) = (p-1)*log10(b)
So:
q <= (p-1)*log10(b)
which proves the bound found in the C standard:
q <= floor((p-1)*log10(b))
Regards,
J.