A problem with bigdecimal.c, and proposed solution (long)

Discussion in 'Ruby' started by Javier Goizueta, Oct 21, 2004.

  1. This is about a problem in ruby/ext/bigdecimal.c. that makes
    BigDecimal#to_f use only 13 significant digits sometimes and
    That may also affect the precision of BigDecimal#sqrt.
    The problem can only be noticed in mswin32 Ruby.

    To see the problem in action take a look at the result of:

    format "%.16f",BigDecimal('1.234567890123456').to_f

    The function VpVtoD() that converts BigDecimal values to double
    has one problem which is hidden by the internal high precision
    of most compilers' floating point, but that reveals itself
    when compiled with Microsoft Visual C/C++.
    Note that mswin32, which uses that compiler, is the most popular Ruby
    implementation on Windows, because the "One-Click Installer" uses it,
    so this is an important issue.

    In big_decimal.c, DBLE_FIG is computed as:

    v = 1.0;
    DBLE_FIG = 0;
    while(v + 1.0 > 1.0) {
    ++DBLE_FIG;
    v /= 10;
    }

    In mswin32, DBLE_FIG becomes 16, which corresponds correctly
    to (one more than) the real decimal precision of a Float
    in IEEE double precision.
    Other compilers on the same hardware (such as GCC, Borland or Digital
    Mars),
    use the internal precision of IEEE extended double numbers (80 bits)
    as much as possible, and DBLE_FIG has a value of 20.

    Note: you can avoid the internal high precision of most compilers
    storing in variables the intermediate values of an expression.

    The problem with VpVtoD is that the number of "figures" (internal
    base-BASE digits, with BASE=4) to be converted is computed as:

    fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;

    Which does not take into account that the first "figure", m->frac[0],
    may contain less than BASE_FIG decimal digits.

    For example, BigDecimal('1.234567890123456') produces an object
    with { 1, 2345, 6789, 123, 4560 } in m->frac. In this case,
    the first "figure" holds just one significative decimal digit.
    Then, as only 4 figures are used, only 13 decimal significative
    digits are converted.
    Note that 4 "figures" can hold up to 16=DBL_FIG decimal digits,
    but in this case they hold only 13.
    With other compilers we have DBLE_FIG=20, so fig=5, and at least
    17 decimal digits are converted, which is ok.

    The result of this is, for example, that

    format "%.16f",BigDecimal('1.234567890123456').to_f

    produces "1.2345678901229999" in mswin32
    (would produce "1.2345678901230000" if it wasn't for the accumulated
    errors when converting decimal fractions to binary floating point).

    In general, this problem causes, under mswin32 versions of Ruby,
    incorrect values for BigDecimal#to_f and BigDecimal#sqrt.

    VpSqrt() uses VpVtoD() so it is affected by the problem, but
    it also uses DBLE_FIG directly, and may have similar problems,
    but I haven't look at it.

    What I propose as a solution is:

    1. Instead of the rather ambiguous DBLE_FIG, define a DOUBLE_DIGITS
    value, as the number of decimal digits that a Float can hold
    (exactly for integral values), which is one less than BASE_FIG when
    correctly computed without in extra internal precision;
    BASE_FIG would then be the number of decimal digits necessary
    for the round-trip conversion double->BigDecimal->double when
    proper rounding is used in the conversions.
    For round-trip conversion with truncation, DOUBLE_DIGITS+1 (=DBLE_FIG+2)
    digits are needed.
    DOUBLE_DIGITS is 15 for IEEE double, and corresponds
    to DBL_DIG in <float.h> so it doesn't need to be computed.
    Anyway, it can be computed (in Ruby, assuming Float uses double) as:

    (BITS*Math.log(2)/Math.log(10)).floor

    BITS corresponds to DBL_MANT_DIG in <float.h> and can be computed as:

    x = 1.0
    bits = 0
    begin
    bits += 1
    x /= 2
    end while 1!=x+1
    bits

    Or also as: 2-Math.frexp(Float::EPSILON)[1]
    In any case, care should be taken so that DOUBLE_DIGITS depends only on
    the underlying double type properties, and not on compiler handling of
    floating point operations.

    2. In VpVtoD(), compute fig to assure that FLOAT_DIGITS+2 are
    always converted, which preserves round-trip conversions,
    for example as:

    fig = 1 + (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;

    3. To avoid breaking existing code, BigDecimal#double_fig could be
    defined
    to return DOUBLE_DIGITS+2, assuming that double_fig was interpreted as
    enough decimal digits to represent unambiguosly a Float value.

    Here is a patch for bigdecimal.c Rev.1.43 that does these things
    I also have replaced (DBLE_FIG + BASE_FIG - 1) / BASE_FIG in VpSqrt,
    but have I'm not really sure what I'm doing there. I haven't
    done extensive testing on this, but seems to work.

    --------------------------------------------------------------BEGIN OF
    PATCH
    --- Rev.1.43/bigdecimal.c 2004-10-19 12:24:40.000000000 +0200
    +++ bigdecimal.c 2004-10-21 17:10:33.000000000 +0200
    @@ -1370,7 +1370,7 @@
    /* The value of BASE**2 + BASE must be represented */
    /* within one U_LONG. */
    static U_LONG HALF_BASE = 5000L;/* =BASE/2 */
    -static S_LONG DBLE_FIG = 8; /* figure of double */
    +static S_LONG DOUBLE_DIGITS = DBL_DIG; /* figure of double */
    static U_LONG BASE1 = 1000L; /* =BASE/10 */

    static Real *VpConstOne; /* constant 1.0 */
    @@ -1515,7 +1515,7 @@
    VP_EXPORT U_LONG
    VpDblFig(void)
    {
    - return DBLE_FIG;
    + return DOUBLE_DIGITS+2;
    }

    VP_EXPORT U_LONG
    @@ -1760,7 +1760,7 @@
    * by one U_LONG word(LONG) in the computer used.
    *
    * [Returns]
    - * DBLE_FIG ... OK
    + * DOUBLE_DIGITS ... OK
    */
    VP_EXPORT U_LONG
    VpInit(U_LONG BaseVal)
    @@ -1799,15 +1799,6 @@
    gnAlloc = 0;
    #endif /* _DEBUG */

    - /* Determine # of digits available in one 'double'. */
    -
    - v = 1.0;
    - DBLE_FIG = 0;
    - while(v + 1.0 > 1.0) {
    - ++DBLE_FIG;
    - v /= 10;
    - }
    -
    #ifdef _DEBUG
    if(gfDebug) {
    printf("VpInit: BaseVal = %lu\n", BaseVal);
    @@ -1819,7 +1810,7 @@
    }
    #endif /* _DEBUG */

    - return DBLE_FIG;
    + return DOUBLE_DIGITS;
    }

    VP_EXPORT Real *
    @@ -3401,7 +3392,7 @@
    * [Output]
    * *d ... fraction part of m(d = 0.xxxxxxx). where # of 'x's is fig.
    * *e ... U_LONG,exponent of m.
    - * DBLE_FIG ... Number of digits in a double variable.
    + * DOUBLE_DIGITS ... Number of digits in a double variable.
    *
    * m -> d*10**e, 0<d<BASE
    * [Returns]
    @@ -3448,7 +3439,7 @@
    goto Exit;
    }
    /* Normal number */
    - fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
    + fig = 1 + (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
    ind_m = 0;
    mm = Min(fig,(m->Prec));
    *d = 0.0;
    @@ -3465,7 +3456,7 @@
    if(gfDebug) {
    VPrint(stdout, " VpVtoD: m=%\n", m);
    printf(" d=%e * 10 **%ld\n", *d, *e);
    - printf(" DBLE_FIG = %ld\n", DBLE_FIG);
    + printf(" DOUBLE_DIGITS = %ld\n", DOUBLE_DIGITS);
    }
    #endif /*_DEBUG */
    return f;
    @@ -3663,7 +3654,7 @@
    }
    VpDtoV(y, sqrt(val)); /* y <- sqrt(val) */
    y->exponent += n;
    - n = (DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
    + n = (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
    y->MaxPrec = (U_LONG)Min(n , y_prec);
    f->MaxPrec = y->MaxPrec + 1;
    n = y_prec*((S_LONG)BASE_FIG);
    --------------------------------------------------------------END OF
    PATCH

    With these changes VpVtoD() works better, but it's still not perfect;
    in particular, x.to_s.to_f is better than x.to_f, even
    if we truncate x.to_s at DOUBLE_DIGITS+2 significative digits, which
    in theory is what VpVtoD does.

    For example:

    (BigDecimal('1')/3).to_f != 1.0/3

    but

    (BigDecimal('1')/3).to_s.to_f == 1.0/3

    or, using no more digits than VpVtoD() does:

    s,d,b,e = (BigDecimal('1')/3).split
    n = [d.size, BigDecimal.double_fig].min
    (d[0...n]+"E#{e-n}").to_f == 1.0/3 # this is true!

    Which proves that VpVtoD could do better. (note that these examples
    are meant for the patched version of bigdecimal)

    - - - - - -

    Last minute note:
    I've realized that in the unpatched version of bigdecimal.c:

    (BigDecimal('1')/3).to_f == 1.0/3

    That means that adding more than necessary "figures" degrades
    the result. This could be solved by being more careful when
    computing fig:

    fig =(DOUBLE_DIGITS+K + 4-n + BASE_FIG - 1) / BASE_FIG;

    where n is the number of significant digits in m->frac[0], i.e.
    one more than the (floor of the) base-BASE logarithm of m->frac[0].
    For BASE=4 n would be (d0<10 ? 1 : (d0<100 ? 2 : (d0<1000 ? 3 : 4)))
    with d0=m->frac[0].

    With K=1 (convert 16 digits for IEEE) this satifies:

    (BigDecimal('1')/3).to_f == 1.0/3

    But not with K=2, which is theoretically what we should take
    (17 digits for IEEE) to have enough precision in all cases.


    Anyway, I think this should be look at carefully by someone else,
    including
    the VpSqrt code. It would be nice if someone writes a more precise
    VpDtoV.

    - - - - - -

    Finally I'd like to propose the inclusion of some Float-related
    constants in the Ruby core.
    I sometimes have used this:

    class Float
    x,bits = 1.0,0
    begin
    bits += 1
    x /= 2
    end while 1!=x+1

    # Number of binary digits of precision in a Float
    BITS = bits

    # Number of decimal digits of precision in a Float
    DIGITS = (BITS*Math.log(2)/Math.log(10)).floor

    # Decimal precision required to represent a Float
    DIGITS2 = DIGITS+2

    EPSILON = Math.ldexp(0.5,2-BITS) unless const_defined?:)EPSILON)
    end

    These constants (EPSILON has already been defined) could
    be defined in the core using <float.h> constants DBL_MANT_DIG and
    DBL_DIG.
    The values for IEEE double precision are:
    Float::BITS = 53
    Float::DIGITS = 15
    Float::DIGITS2 = 17
    Float::EPSILON = 2.2204460492503131E-16

    - - - - - - - -
    Regards,

    --Javier
    Javier Goizueta, Oct 21, 2004
    #1
    1. Advertising

  2. > Last minute note:
    > I've realized that in the unpatched version of bigdecimal.c:
    >
    > (BigDecimal('1')/3).to_f == 1.0/3
    >
    > That means that adding more than necessary "figures" degrades
    > the result. This could be solved by being more careful when
    > computing fig:
    >
    > fig =(DOUBLE_DIGITS+K + 4-n + BASE_FIG - 1) / BASE_FIG;


    I've done a small change in the while loop of that improves the
    accuracy: accumulate div by multiplying, not dividing:

    while(ind_m < mm) {
    div *=(flt_t)((S_INT)BASE);
    *d += ((flt_t) ((S_INT)m->frac[ind_m++])) / div;
    }

    Using this change and the previous one properly applied, the VpVtoD
    function is working quite well now; Here's a patch from Rev.1.43 with
    all the changes:


    --------------------------------------------------------------BEGIN OF
    PATCH
    --- Rev.1.43/bigdecimal.c 2004-10-19 12:24:40.000000000 +0200
    +++ bigdecimal.c 2004-10-21 20:21:58.000000000 +0200
    @@ -1370,7 +1370,7 @@
    /* The value of BASE**2 + BASE must be represented */
    /* within one U_LONG. */
    static U_LONG HALF_BASE = 5000L;/* =BASE/2 */
    -static S_LONG DBLE_FIG = 8; /* figure of double */
    +static S_LONG DOUBLE_DIGITS = DBL_DIG; /* figure of double */
    static U_LONG BASE1 = 1000L; /* =BASE/10 */

    static Real *VpConstOne; /* constant 1.0 */
    @@ -1515,7 +1515,7 @@
    VP_EXPORT U_LONG
    VpDblFig(void)
    {
    - return DBLE_FIG;
    + return DOUBLE_DIGITS+2;
    }

    VP_EXPORT U_LONG
    @@ -1760,7 +1760,7 @@
    * by one U_LONG word(LONG) in the computer used.
    *
    * [Returns]
    - * DBLE_FIG ... OK
    + * DOUBLE_DIGITS ... OK
    */
    VP_EXPORT U_LONG
    VpInit(U_LONG BaseVal)
    @@ -1799,15 +1799,6 @@
    gnAlloc = 0;
    #endif /* _DEBUG */

    - /* Determine # of digits available in one 'double'. */
    -
    - v = 1.0;
    - DBLE_FIG = 0;
    - while(v + 1.0 > 1.0) {
    - ++DBLE_FIG;
    - v /= 10;
    - }
    -
    #ifdef _DEBUG
    if(gfDebug) {
    printf("VpInit: BaseVal = %lu\n", BaseVal);
    @@ -1819,7 +1810,7 @@
    }
    #endif /* _DEBUG */

    - return DBLE_FIG;
    + return DOUBLE_DIGITS;
    }

    VP_EXPORT Real *
    @@ -3401,7 +3392,7 @@
    * [Output]
    * *d ... fraction part of m(d = 0.xxxxxxx). where # of 'x's is fig.
    * *e ... U_LONG,exponent of m.
    - * DBLE_FIG ... Number of digits in a double variable.
    + * DOUBLE_DIGITS ... Number of digits in a double variable.
    *
    * m -> d*10**e, 0<d<BASE
    * [Returns]
    @@ -3413,7 +3404,8 @@
    VP_EXPORT int
    VpVtoD(double *d, S_LONG *e, Real *m)
    {
    - U_LONG ind_m, mm, fig;
    + int n;
    + U_LONG ind_m, mm, fig, d0;
    double div;
    int f = 1;

    @@ -3448,14 +3440,20 @@
    goto Exit;
    }
    /* Normal number */
    - fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
    + d0 = m->frac[0];
    + n = 1;
    + while (d0>=10) {
    + n++;
    + d0 /= 10;
    + }
    + fig = (DOUBLE_DIGITS+2 + 4-n + BASE_FIG - 1) / BASE_FIG;
    ind_m = 0;
    mm = Min(fig,(m->Prec));
    *d = 0.0;
    div = 1.;
    while(ind_m < mm) {
    - div /=(double)((S_INT)BASE);
    - *d = *d +((double) ((S_INT)m->frac[ind_m++])) * div;
    + div *=(double)((S_INT)BASE);
    + *d += ((double) ((S_INT)m->frac[ind_m++])) / div;
    }
    *e = m->exponent * ((S_INT)BASE_FIG);
    *d *= VpGetSign(m);
    @@ -3465,7 +3463,7 @@
    if(gfDebug) {
    VPrint(stdout, " VpVtoD: m=%\n", m);
    printf(" d=%e * 10 **%ld\n", *d, *e);
    - printf(" DBLE_FIG = %ld\n", DBLE_FIG);
    + printf(" DOUBLE_DIGITS = %ld\n", DOUBLE_DIGITS);
    }
    #endif /*_DEBUG */
    return f;
    @@ -3663,7 +3661,7 @@
    }
    VpDtoV(y, sqrt(val)); /* y <- sqrt(val) */
    y->exponent += n;
    - n = (DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
    + n = (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
    y->MaxPrec = (U_LONG)Min(n , y_prec);
    f->MaxPrec = y->MaxPrec + 1;
    n = y_prec*((S_LONG)BASE_FIG);
    --------------------------------------------------------------END OF
    PATCH


    Note: in my previous post,

    format "%.16f",BigDecimal('1.234567890123456').to_f

    meant to be

    format "%.15f",BigDecimal('1.234567890123456').to_f

    --Javier
    Javier Goizueta, Oct 21, 2004
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Marsaglia

    Assigning unsigned long to unsigned long long

    George Marsaglia, Jul 8, 2003, in forum: C Programming
    Replies:
    1
    Views:
    645
    Eric Sosman
    Jul 8, 2003
  2. Replies:
    3
    Views:
    1,207
  3. Mathieu Dutour

    long long and long

    Mathieu Dutour, Jul 17, 2007, in forum: C Programming
    Replies:
    4
    Views:
    447
    santosh
    Jul 24, 2007
  4. Bart C

    Use of Long and Long Long

    Bart C, Jan 9, 2008, in forum: C Programming
    Replies:
    27
    Views:
    767
    Peter Nilsson
    Jan 15, 2008
  5. Stanimir Stamenkov
    Replies:
    4
    Views:
    2,551
    Stanimir Stamenkov
    Jul 18, 2008
Loading...

Share This Page