Floating-point bit hacking: iterated nextafter() without loop?

D

Daniel Vallstrom

I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0. Something
equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}

The problem with the above loop is that it may take too long because n
could be quite large. Moreover, very many such loops would be executed.

Disappointedly gcc doesn't seem to be able to optimize over nextafter
at all.

However, with only some modest bit hacking and without assuming much
about the floating point implementation it is possible to calculate
the nth greater value fast. At the end of this post is a program
showing such a solution. Of course it is not guaranteed to work as
intended but is there any non-odd platform where it will fail? If
there are any places where the solution could fail, what would the
consequences be? Is there a better solution to the problem than the
one below?


Daniel Vallstrom



// Tests consecutive nextafter calls and a way to compute the same thing fast
// without iterated nextafter calls.
// Daniel Vallstrom, 041014.
// Compile with e.g: gcc -std=c99 -pedantic -Wall -O3 -lm nextafterNTest.c

/* The program should print something like:
n: 0x1000000 (16777216)
x0: 0x1.001p+8
xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
The xn loop took 0.8 cpu seconds.
m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022
e: exp2( floor( log2( x0 ) ) ): 0x1p+8
z: e | m: 0x1.0000001p+8
z1: z - e: 0x1p-20
z2: x0 + z1: 0x1.0010001p+8 (==xn)
u: n unrolled nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8 (==xn)
The u loops took 0.7 cpu seconds.
*/

#include <stdio.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <time.h>


#define nextafter2M( x, y ) (x) = nextafter((x),(y)); (x) = nextafter((x),(y))
#define nextafter4M( x, y ) nextafter2M( (x), (y) ); nextafter2M( (x), (y) )
#define nextafter8M( x, y ) nextafter4M( (x), (y) ); nextafter4M( (x), (y) )
#define nextafter16M( x, y ) nextafter8M( (x), (y) ); nextafter8M( (x), (y) )
#define nextafter32M( x, y ) nextafter16M( (x), (y) ); nextafter16M( (x), (y) )
#define nextafter64M( x, y ) nextafter32M( (x), (y) ); nextafter32M( (x), (y) )


int main( void )
{
clock_t clockAtLoopStart; // Used to time the start of loops.
clock_t clockAtLoopEnd; // Used to time the end of loops.

// The number of nextafter iterations to use.
unsigned int n = ( 1u << 24 );
printf( "\nn: %#x (%u)\n", n, n );

// The base double to apply nextafter to.
double x0 = 0x1.001p8;
printf( "x0: %a\n", x0 );

// The nth double greater than x0.
double xn = x0;
clockAtLoopStart = clock();
for ( unsigned int k = 0; k != n; k++ )
{
xn = nextafter( xn, DBL_MAX );
}
clockAtLoopEnd = clock();
printf( "xn: n nextafter( x0, DBL_MAX ) iterations: %a\n", xn );
// xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
printf( " The xn loop took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The xn loop took 0.8 cpu seconds.


// The goal is to compute xn fast. At least gcc -O3 does *not* optimize
// iterated nextafter calls (the above loop). If the loop (n) is long (or
// many loops have to be done) it takes too long to run.

// Calculate a useful mantissa:
double m = nextafter( 0.0, DBL_MAX );
m = n*m;
printf( "m: n * nextafter( 0.0, DBL_MAX ): %a\n", m );
// m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022

// Calculate a useful exponent part.
double e = exp2( floor( log2( x0 ) ) );
printf( "e: exp2( floor( log2( x0 ) ) ): %a\n", e );
// e: exp2( floor( log2( x0 ) ) ): 0x1p+8

// Combine e and m.
double z = e;
for ( unsigned char * zPtr = &z, * zEnd = zPtr + sizeof(z), * mPtr = &m;
zPtr != zEnd; zPtr++, mPtr++ )
{
*zPtr |= *mPtr;
}
printf( "z: e | m: %a\n", z );
// z: e | m: 0x1.0000001p+8

// Remove the faulty first mantissa bit in z, the one coming from e.
double z1 = z - e;
printf( "z1: z - e: %a\n", z1 );
// z1: z - e: 0x1p-20

// Finally we are ready to add the value to x0.
double z2 = x0 + z1;
printf( "z2: x0 + z1: %a (==xn)\n", z2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( z2 == xn );


// Test if unrolling the nextafter calls helps. (It doesn't (besides
// the speed-up from the unrolling itself).)

double u = x0;

clockAtLoopStart = clock();
for ( unsigned int k = n/64; k != 0; k-- )
{
nextafter64M( u, DBL_MAX );
}

for ( unsigned int k = n%64; k != 0; k-- )
{
u = nextafter( u, DBL_MAX );
}
clockAtLoopEnd = clock();

printf( "u: n unrolled nextafter( x0, DBL_MAX ) iterations: %a (==xn)\n",
u );
// u: n unrolled nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8 (==xn)
assert( u == xn );
printf( " The u loops took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The u loops took 0.7 cpu seconds.


return 0;
}
 
D

Daniel Vallstrom

Ack! No replies! Bump. Or should I take the fact that there have
been no replies to mean that the program supplied in the first
post does work in practice?p

I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0.

For the background to the problem, I forgot to say that f(n)>0 and
increases with n, i.e. n0<n1 -> f(n0)<f(n1). Also, the various
floating values involved aren't god given but rather some heuristic
measures. Hence, when you want to add f(n) to x0, it sometimes makes
sense to increment x0 n times instead of doing nothing when the
accuracy of the floating type is too small (i.e. when x0 + f(n) == x0).
Something equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}

The problem with the above loop is that it may take too long because n
could be quite large. Moreover, very many such loops would be executed.

To see that the approach involving iterated nextafter() calls is
infeasible, consider that values of n could on average be around
2**24, like in the program in the first post. Each such n-iterated
nextafter() loop will take about a second to compute. Furthermore,
thousands or millions such loops could very well have to be executed.

[snip suggested solution to the problem]

Any input is appreciated. Regards,


Daniel Vallstrom
 
D

Dik T. Winter

> Ack! No replies! Bump. Or should I take the fact that there have
> been no replies to mean that the program supplied in the first
> post does work in practice?p

Perhaps everybody is stumped? You will not find much floating-point
expertise in this newsgroup.
>
> For the background to the problem, I forgot to say that f(n)>0 and
> increases with n, i.e. n0<n1 -> f(n0)<f(n1). Also, the various
> floating values involved aren't god given but rather some heuristic
> measures. Hence, when you want to add f(n) to x0, it sometimes makes
> sense to increment x0 n times instead of doing nothing when the
> accuracy of the floating type is too small (i.e. when x0 + f(n) == x0).
>
>
> To see that the approach involving iterated nextafter() calls is
> infeasible, consider that values of n could on average be around
> 2**24, like in the program in the first post. Each such n-iterated
> nextafter() loop will take about a second to compute. Furthermore,
> thousands or millions such loops could very well have to be executed.

My suggestion for the loop would be breaking up the floating-point number in
its constituent parts, and work with them. In that case you add n (suitably
scaled) to the mantissa and recombine. Be careful with overflow in the
mantissa when you add... See frexp and ldexp (at least on my system).
 
C

CBFalconer

Accumulated ugly effects of top-posting fixed, I hope.

Dik T. Winter said:
Perhaps everybody is stumped? You will not find much floating-point
expertise in this newsgroup.

I would attempt to keep the accumulated corrections separate.
Ideally such a value would be accumulated starting with the
smallest correction, but we can't necessarily have everything.

double x0, xnow, delta, n

for (n = 1, x0 = XSTART, delta = 0; n <= NLAST; n++) {
delta += f(n);
xnow = x0 + delta;
}

which should at least mitigate the problem. FP arithmetic is
inherently an approximation, so we can never have everything. At
best we can define the error bounds. In this case the bounds will
be a function of the relative magnitudes of delta, f(n), and x0.
 
C

Chris Barts

Daniel said:
Ack! No replies! Bump.

Bump? This is Usenet. Good newsreaders don't bump, they just add
responses to the original thread, which stays put, visually speaking.
Or should I take the fact that there have
been no replies to mean that the program supplied in the first
post does work in practice?

Take it to mean that the comp.lang.c folks don't know enough about
floating-point math at the bit level to help you.

I'd suggest some other newsgroups, but I don't know enough to help you
there, either. comp.programming is probably a good catchall for this
sort of thing. Or you could find some web discussion groups.
 
D

Daniel Vallstrom

Dik T. Winter said:
My suggestion for the loop would be breaking up the floating-point number in
its constituent parts, and work with them. In that case you add n (suitably
scaled) to the mantissa and recombine. Be careful with overflow in the
mantissa when you add... See frexp and ldexp (at least on my system).

Thanks! That got me thinking. I'll add an unpolished new
solution at the end of this post. (I ought to have found
it directly but I never moved on after failing to find
standard library functions that directly could do the
bit hacking (mantissa-exponent combining) that I wanted.)

I guess that when the floating point exponent base (FLT_RADIX)
isn't 2, the code won't work, at least not without some fixing.

Daniel Vallstrom


// Tests consecutive nextafter calls and a way to compute the same thing fast
// without iterated nextafter calls.
// Daniel Vallstrom, 041018.
// Compile with e.g: gcc -std=c99 -pedantic -Wall -O3 -lm nextafterNTest.c

/* The program should print something like:
n: 0x1000000 (16777216)
x0: 0x1.001p+8
xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
The xn loop took 0.8 cpu seconds.
m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022
e: exp2( floor( log2( x0 ) ) ): 0x1p+8
z: e | m: 0x1.0000001p+8
z1: z - e: 0x1p-20
z2: x0 + z1: 0x1.0010001p+8 (==xn)
x0Exp: frexp( x0, &x0Exp ): 9
mnorm: DBL_MIN + m: 0x1.0000001p-1022
zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)
zb1: zb - e: 0x1p-20
zb2: x0 + zb1: 0x1.0010001p+8 (==xn)
*/

#include <stdio.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <time.h>


int main( void )
{
#if FLT_RADIX != 2
#error Case not supported.
#endif

clock_t clockAtLoopStart; // Used to time the start of loops.
clock_t clockAtLoopEnd; // Used to time the end of loops.

// The number of nextafter iterations to use. Can be anything.
unsigned int n = ( 1u << 24 );
printf( "\nn: %#x (%u)\n", n, n );

// The base double to apply nextafter to. Can be anything.
double x0 = 0x1.001p8;
printf( "x0: %a\n", x0 );

// The nth double greater than x0.
double xn = x0;
clockAtLoopStart = clock();
for ( unsigned int k = 0; k != n; k++ )
{
xn = nextafter( xn, DBL_MAX );
}
clockAtLoopEnd = clock();
printf( "xn: n nextafter( x0, DBL_MAX ) iterations: %a\n", xn );
// xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
printf( " The xn loop took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The xn loop took 0.8 cpu seconds.


// The goal is to compute xn fast. At least gcc -O3 does *not* optimize
// iterated nextafter calls (the above loop). If the loop (n) is long (or
// many loops have to be done) it takes too long to run.

// Calculate a useful mantissa:
double m = nextafter( 0.0, DBL_MAX );
m = n*m;
printf( "m: n * nextafter( 0.0, DBL_MAX ): %a\n", m );
// m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022

// Calculate a useful exponent part.
double e = exp2( floor( log2( x0 ) ) );
printf( "e: exp2( floor( log2( x0 ) ) ): %a\n", e );
// e: exp2( floor( log2( x0 ) ) ): 0x1p+8

// Combine e and m.
double z = e;
for ( unsigned char * zPtr = &z, * zEnd = zPtr + sizeof(z), * mPtr = &m;
zPtr != zEnd; zPtr++, mPtr++ )
{
*zPtr |= *mPtr;
}
printf( "z: e | m: %a\n", z );
// z: e | m: 0x1.0000001p+8

// Remove the faulty first mantissa bit in z, the one coming from e.
double z1 = z - e;
printf( "z1: z - e: %a\n", z1 );
// z1: z - e: 0x1p-20

// Finally we are ready to add the value to x0.
double z2 = x0 + z1;
printf( "z2: x0 + z1: %a (==xn)\n", z2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( z2 == xn );


// Here is an approach using frexp and ldexp instead of bit-oring.

int x0Exp;
frexp( x0, &x0Exp );
printf( "x0Exp: frexp( x0, &x0Exp ): %d\n", x0Exp );
// x0Exp: frexp( x0, &x0Exp ): 9

// Normalize the mantissa.
double mnorm = DBL_MIN + m;
printf( "mnorm: DBL_MIN + m: %a\n", mnorm );
// mnorm: DBL_MIN + m: 0x1.0000001p-1022

double zb = ldexp( mnorm, x0Exp-DBL_MIN_EXP );
printf( "zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): %a (==z)\n", z );
// zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)

assert( zb == z );
// The rest is just like above.

// Remove the faulty first mantissa bit in zb.
double zb1 = zb - e;
printf( "zb1: zb - e: %a\n", zb1 );
// zb1: zb1 - e: 0x1p-20

// We are ready to add the value to x0.
double zb2 = x0 + zb1;
printf( "zb2: x0 + zb1: %a (==xn)\n", zb2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( zb2 == xn );


return 0;
}
 
E

Eric Sosman

Daniel said:
I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0. Something
equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}
[...]

Look up "Kahan's summation formula."
 
D

Daniel Vallstrom

Dik T. Winter said:
[snip question asking how to compute the following loop in a fast manner]

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}
My suggestion for the loop would be breaking up the floating-point number in
its constituent parts, and work with them. In that case you add n (suitably
scaled) to the mantissa and recombine. Be careful with overflow in the
mantissa when you add... See frexp and ldexp (at least on my system).

Thanks again. Here is a simple solution:

int x0Exp;
frexp( x0, &x0Exp );
x0 += ldexp( n * nextafter( 0.0, DBL_MAX ), x0Exp - DBL_MIN_EXP );

I'm still miffed that I didn't find the above solution directly
(eventhough hindsight is always 20/20). I'll post a full program with
the solution at the end.

Daniel Vallstrom


// Tests consecutive nextafter calls and ways to compute the same thing fast
// without iterated nextafter calls.
// Daniel Vallstrom, 041019.
// Compile with e.g: gcc -std=c99 -pedantic -Wall -O3 -lm nextafterNTest.c

/* The program should print something like:
n: 0x1000000 (16777216)
x0: 0x1.001p+8
xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
The xn loop took 0.8 cpu seconds.
m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022
e: exp2( floor( log2( x0 ) ) ): 0x1p+8
z: e | m: 0x1.0000001p+8
z1: z - e: 0x1p-20
z2: x0 + z1: 0x1.0010001p+8 (==xn)
x0Exp: frexp( x0, &x0Exp ): 9
mnorm: DBL_MIN + m: 0x1.0000001p-1022
zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)
zb1: zb - e: 0x1p-20 (==z1)
zb2: x0 + zb1: 0x1.0010001p+8 (==xn)
zc1: ldexp( m, x0Exp-DBL_MIN_EXP ): 0x1p-20 (==z1)
zc2: x0 + zc1: 0x1.0010001p+8 (==xn)
*/

#include <stdio.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <time.h>


int main( void )
{
#if FLT_RADIX != 2
#error Case not supported.
#endif

clock_t clockAtLoopStart; // Used to time the start of loops.
clock_t clockAtLoopEnd; // Used to time the end of loops.

// The number of nextafter iterations to use. Can be anything.
unsigned int n = ( 1u << 24 );
printf( "\nn: %#x (%u)\n", n, n );

// The base double to apply nextafter to. Can be anything.
double x0 = 0x1.001p8;
printf( "x0: %a\n", x0 );

// The nth double greater than x0.
double xn = x0;
clockAtLoopStart = clock();
for ( unsigned int k = 0; k != n; k++ )
{
xn = nextafter( xn, DBL_MAX );
}
clockAtLoopEnd = clock();
printf( "xn: n nextafter( x0, DBL_MAX ) iterations: %a\n", xn );
// xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
printf( " The xn loop took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The xn loop took 0.8 cpu seconds.


// The goal is to compute xn fast. At least gcc -O3 does *not* optimize
// iterated nextafter calls (the above loop). If the loop (n) is long (or
// many loops have to be done) it takes too long to run.

// Calculate a useful mantissa:
double m = nextafter( 0.0, DBL_MAX );
m = n*m;
printf( "m: n * nextafter( 0.0, DBL_MAX ): %a\n", m );
// m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022

// Calculate a useful exponent part.
double e = exp2( floor( log2( x0 ) ) );
printf( "e: exp2( floor( log2( x0 ) ) ): %a\n", e );
// e: exp2( floor( log2( x0 ) ) ): 0x1p+8

// Combine e and m.
double z = e;
for ( unsigned char * zPtr = &z, * zEnd = zPtr + sizeof(z), * mPtr = &m;
zPtr != zEnd; zPtr++, mPtr++ )
{
*zPtr |= *mPtr;
}
printf( "z: e | m: %a\n", z );
// z: e | m: 0x1.0000001p+8

// Remove the faulty first mantissa bit in z, the one coming from e.
double z1 = z - e;
printf( "z1: z - e: %a\n", z1 );
// z1: z - e: 0x1p-20

// Finally we are ready to add the value to x0.
double z2 = x0 + z1;
printf( "z2: x0 + z1: %a (==xn)\n", z2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( z2 == xn );


// Here is an approach using frexp and ldexp instead of bit-oring.

int x0Exp;
frexp( x0, &x0Exp );
printf( "x0Exp: frexp( x0, &x0Exp ): %d\n", x0Exp );
// x0Exp: frexp( x0, &x0Exp ): 9

// Normalize the mantissa.
double mnorm = DBL_MIN + m;
printf( "mnorm: DBL_MIN + m: %a\n", mnorm );
// mnorm: DBL_MIN + m: 0x1.0000001p-1022

// Increase the exponent of mnorm to match the exponent of x0.
double zb = ldexp( mnorm, x0Exp-DBL_MIN_EXP );
printf( "zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): %a (==z)\n", zb );
// zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)

assert( zb == z );
// The rest is just like above.

// Remove the faulty first mantissa bit in zb.
assert( e == ldexp( 1.0, x0Exp-1 ) );
double zb1 = zb - e;
printf( "zb1: zb - e: %a (==z1)\n", zb1 );
// zb1: zb1 - e: 0x1p-20
assert( zb1 == z1 );

// We are ready to add the value to x0.
double zb2 = x0 + zb1;
printf( "zb2: x0 + zb1: %a (==xn)\n", zb2 );
// zb2: x0 + zb1: 0x1.0010001p+8 (==xn)
assert( zb2 == xn );


// Here is a better solution.

double zc1 = ldexp( m, x0Exp-DBL_MIN_EXP );
printf( "zc1: ldexp( m, x0Exp-DBL_MIN_EXP ): %a (==z1)\n", zc1 );
// zc1: ldexp( m, x0Exp-DBL_MIN_EXP ): 0x1p-20 (==z1)
assert( zc1 == z1 );

// Add the value to x0.
double zc2 = x0 + zc1;
printf( "zc2: x0 + zc1: %a (==xn)\n", zc2 );
// zc2: x0 + zc1: 0x1.0010001p+8 (==xn)
assert( zc2 == xn );


return 0;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top