assert( x > 0.0 && 1 && x == 0.0 ) holding

Christian Bau · Nov 2, 2004

Tim Rentsch said:
It's true that the result here is denormalized in a sense (it's as
normalized as it's possible for this value to be, given how floating
point numbers are represented). But the larger problem is more
pervasive than just numbers close to zero, or denormalized numbers.

What is the right way to handle the larger problem?

You could have a look at the Java language specification, where this
problem is handled. The first Java implementations required that all
double operations are actually done in double precision and not with
higher precision or a higher exponent range. A very reasonable demand
which unfortunately made correct implementations on x86 10 times slower
than incorrect ones in extreme cases. So now Java has "strict fp" and
"non-strict fp". In the "non-strict" case, an extended exponent range is
allowed in absolutely well-defined situations.

Daniel Vallstrom · Nov 2, 2004

Tim Rentsch said:
Tim Rentsch said:

(e-mail address removed) (Daniel Vallstrom) writes:

[lots snipped]

Click to expand...

Click to expand...

[large snip]

Furthermore, this suggests another means to accomplish the forced
access that doesn't suffer the restriction of needing an L-value
(which the address-casting approach has) or need another local
variable:

static inline int
double_GE( volatile double a, volatile double b ){
return a >= b;
}

. . .

assert( double_GE( y, f() ) );

Using this approach on the test program caused the "can't possibly
succeed" assertion to fail, as desired. Incidentally, leaving
off the 'volatile' specifiers on the parameters left the assertion
in the "can't possibly succeed, yet it does" state.

Nice. This is a clean solution. However, I haven't incorporated
it yet and are still using the old trusted solution with the
explicit extra volatile variable. Truthfully, even though I believe
that your cleaner solution will work, I feel somewhat reluctant to
convert to it since sadly I have lost trust in gcc. After all, the
simple double cast is also supposed to work. Reading e.g. Dan Pop
elsethread doesn't help to bring back trust in gcc either.

[snip]

It seems right that the standard mandates that casting to double
should make the asssertion do what you'd expect, and it therefore
seems right that the program not doing that means gcc has a bug.
Personally I think this specification is mildly insane; if x is a
double (and assuming x has a floating point value rather than NaN),
the expression 'x == (double)x' should ALWAYS be true.
Agreed.

But even if
we accept that using a '(double)' cast will correct the 64/80 bit
problem, this still seems like the wrong way to solve the problem,
for the same reason that using a 'volatile' forced access seems
wrong - see below.

[snip]

My next question is, what is it that you are really
trying to guarantee?

Together with what you write at the bottom of your reply it seems
that you are asking for a full-blown example containing the problem?
I'll try to cut out the relevant parts of the program with the problem
and patch it up to some smaller program.

[snip]

Rewriting the assertion - whether using (double) or volatile or some
other mechanism - and doing nothing else isn't the right way to solve
the problem. Here's my reasoning.

Why are you writing the assertion in the first place? Presumably it's
to guarantee that some other piece of code that depends on that
assertion being true is going to execute at some point in the (perhaps
near) future.

Yes and no. I'm an assert junky and like to take a Hoare/DBC
[see e.g. wikipedia?] like approach where I assert key properties
to hold inside a function implementing an algorithm but also at
the end, when a function is finished, assert that all properties
that should hold also do so. So I'm not really worrying about some
future code but am just asserting that 1) the function is correct
and 2) that all the properties promised by the function hold. In
this sense I'm not protecting code as such but rather properties
code can use.

In the program I'll try to post at the end, the main property
to hold at the end of the function is that an array was sorted
"modulo double casts" (i.e. after removing excessive precision).
That's all you can aim for. The property is not that the array
is "absolutely sorted also for 80 bit double and you can do
whatever you want". For example, say that element i is >= element
j according to the sorting function, and that i's floating value
is x and j's is y. Then it's asserted that (double)x >= (double)y
but it might still be the case that y > x (using extra precision).
(Actually, this is precisely the case in the program at the end.)

With this approach the simple volatile solution is sufficient,
AFAICS.

For example, if

assert( y >= f() );

perhaps we are going to form the expression 'y - f()' and depend on
that value being non-negative.

Whatever it is that we're depending on should be exactly expressed by
what is in the assertion (or at least, should be implied by what is in
the assertion). If we write the assertion one way, and the later code
a different way, that won't be true - the assertion could succeed, and
the later code fail, or vice versa. So, whatever it is we do, the
assertion should be written in the very same form as the code that
the assertion is supposed to protect. Thus, if we have

static inline double
double_minus( volatile double a, volatile double b ){
return a - b;
}

/* make sure y - f() is non-negative */
assert( double_minus( y, f() ) >= 0 );

then the later code should use

difference = double_minus( y, f() );

and not

difference = y - f();

You see what I mean?

I guess so but it doesn't really apply to the DBC approach.
It seems better to try specify more general properties to hold
and try to make the properties as strong as possible. I am
having trouble seeing a real situation where there would
be a difference. Can you give a more concrete example?

That's related to my question about
reformulating the assertion expression, so that the assertion
expression and the subsequent code that depends on it are guaranteed
to be in sync, whatever it is that the subsequent really needs to
guarantee. Because the subsequent code may have just the same
problems that the code in the assertion expression has.

Maybe the program example I'll post at the end will clear up what
I'm trying to do.

[snip]

I've been in touch with the person who is now chairing the IEEE-754
revision committee, and he's going to take up this whole question with
the committee. He's asked for a code sample that illustrates the
'y >= f()' problem; if you get that to me I'll forward it to him.
If it's too large to post please feel free to send it to me in email.

Alright, here is a sorting program. (I'll add the background to the
program in parenthesis. The program originates from a constraint
solver. You want to sort variables according to some strength
heuristic in order to branch on the strongest variables first.) The
program sorts an array a=[2,4,6...] according to a certain strength
measure. Each element (variable) p in 'a' has two strength measures
associated with it, str[p] and str[p+1] (the strength of p and its
negation -p respectively). The full strength measure of p (or rather
the pair (p,-p)) is then str[p]*str[p+1] (because the proof
derivations/search simplifications possible is proportional to the
product).

Without the volatile qualifiers, assertions fail on my x86. The output
I get with str defined as below is:

4 0x8p-1077 0x0.0000000000001p-1022
6 0x8p-2151 0x0p+0
8 0x8p-2151 0x0p+0
10 0x8p-2151 0x0p+0
2 0x8p-1079 0x0p+0

You can see that the array is not sorted w.r.t 80 bit precision.

As said, I just cut out the program and tried to patch it up. It
seems to work but I haven't read through it yet. That will have
to wait to tomorrow though. This has been a too long post as it
is already. Let me know if something didn't make sense

Daniel Vallstrom

/* Implementation of quicksort with floating point comparisons and
assertions somewhat akin to a Hoare or DBC style.
Daniel Vallstrom, 041102.
Compile with e.g: gcc -std=c99 -pedantic -Wall -O -lm fpqsort.c
*/

#include <stdio.h>
#include <assert.h>
#include <math.h>
#include <stdlib.h>

// The higher AssertionLevel is, the more expensive assertions are made.
#define AssertionLevel 5

#define assert3(p) assert(p)

// The main floating point type.
typedef double Strength;

// Swaps 'a' and 'b' using temp as temporary storage.
#define swapM( a, b, temp ) { (temp) = (a); (a) = (b); (b) = (temp); }

// Sorts a, b and c optimally using temp as temporary storage. f o ('+'s)
// will be applied to the elements to be sorted and then they are compared
// using <. The largest element will be placed first.
#define sort3FPM( a, b, c, temp, f, s ) \
{ \
if ( (f)((s)+(a)) < (f)((s)+(b)) ) \
{ \
if ( (f)((s)+(a)) < (f)((s)+(c)) ) \
{ \
if ( (f)((s)+(b)) < (f)((s)+(c)) ) \
{ \
swapM( (a), (c), (temp) ); \
} \
else \
{ \
(temp) = (a); \
(a) = (b); \
(b) = (c); \
(c) = (temp); \
} \
} \
else \
{ \
swapM( (a), (b), (temp) ); \
} \
} \
else \
{ \
if ( (f)((s)+(b)) < (f)((s)+(c)) ) \
{ \
if ( (f)((s)+(a)) < (f)((s)+(c)) ) \
{ \
(temp) = (a); \
(a) = (c); \
(c) = (b); \
(b) = (temp); \
} \
else \
{ \
swapM( (b), (c), (temp) ); \
} \
} \
} \
}

// Sorts the array a. f o ('+'s) will be applied to the elements to be
// sorted and then they are compared using <. The largest element will be
// placed first. s is an array of strength for each literal. n elements
// are sorted.
// For now we'll do a recursive quicksort.
static void srt( unsigned int * a, Strength (*f)( Strength * x ),
Strength * s, unsigned int n )
{
unsigned int tmp;

switch ( n )
{
case 0:
case 1:

break;

case 2:

if ( f( s+a[0] ) < f( s+a[1] ) )
{
swapM( a[0], a[1], tmp );
}

break;

case 3:

sort3FPM( a[0], a[1], a[2], tmp, f, s );

break;

default:

{
// Pointer to the right of the pivot. *c and all values to the
// right of c is <= p, the pivot value.
unsigned int * c = a + n - 1;

{
// Set up the pivot using the first, middle and last positions.

// Pivot pointer.
unsigned int * pPtr = a + n/2;

sort3FPM( *a, *pPtr, *c, tmp, f, s );

// The pivot value.
Strength p = f(s+*pPtr);

// Sort in halves w.r.t the pivot.

// Pointer to the left of the pivot. *b and all values to the
// left of b is >= p.
unsigned int * b = a;

while ( 1 )
{
// Search for a value < p to the left.
do
{
b++;
}
while ( f(s+*b) >= p && b != c );

if ( b == c )
{
#ifndef NDEBUG
// Because of a bug in gcc, a (Strength) cast doesn't
// ensure that excessive precision is stripped. Hence
// the use of the intermediate volatile variables.
volatile Strength xvol = (Strength)f( s + c[-1] );
volatile Strength pvol = (Strength)p;
assert( xvol >= pvol );
xvol = f( s + *c );
assert( xvol <= pvol );
#endif

break;
}

// Search for a value > p to the right.
do
{
c--;
}
while ( f(s+*c) <= p && c != b );

if ( c == b )
{
#ifndef NDEBUG
volatile Strength xvol = (Strength)f( s + c[-1] );
volatile Strength pvol = (Strength)p;
assert( xvol >= pvol );
xvol = f( s + *c );
assert( xvol <= pvol );
#endif

break;
}

swapM( *c, *b, tmp );
}
}

assert( c-a >= 1 );
assert( n - (c-a) >= 1 );

srt( a, f, s, c-a );
srt( c, f, s, n - (c-a) );
}

break;
}
}

// Sorts the array a. f o ('+'s) will be applied to the elements to be
// sorted and then they are compared using <. The largest element will be
// placed first. s is an array of strength for each literal. n elements
// are sorted.
static void sort( unsigned int * a, Strength (*f)( Strength * x ),
Strength * s, unsigned int n )
{
srt( a, f, s, n );

// Assert that a is sorted.
#if AssertionLevel >= 3 && !defined(NDEBUG)
{
unsigned int * aEnd = a + n;

if ( a != aEnd )
{
// The sum of all variables in a. Used to ensure with high
// probability that each variable occurs exactly once.
unsigned int sum = *a;

#if 1
printf( "\n %lu %La %a\n", (unsigned long int) *a,
(long double) f(s+a[0]), f(s+a[0]) ); // tmp print
#endif
for ( a++; a != aEnd; a++ )
{
volatile Strength xvol = (Strength)f(s+a[-1]);
volatile Strength yvol = (Strength)f(s+a[0]);
assert3( xvol >= yvol );
#if 1
printf( " %lu %La %a\n", (unsigned long int) *a,
(long double) f(s+a[0]), yvol ); // tmp print
#endif
sum += *a;
}

assert( 2*sum == n * ( 2*n + 2 ) );
}
}
#endif
}

// Returns a measure of the combined strength of p and -p given the individual
// strength of p and -p. varStrPtr[0] is the strength of the unnegated variable
// (p), varStrPtr[1] is the strength of the negated variable (-p).
static Strength litPairStrengthF( Strength * varStrPtr )
{
return varStrPtr[0] * varStrPtr[1];
}

int main( void )
{
// The number of elements to be sorted.
unsigned int n = 5;

// The array to be sorted. Contains the numbers 2, 4, 6, 8, ...
// a is considered >= a[j] if litPairStrengthF( str + a ) >=
// litPairStrengthF( str + a[j] ), where str is defined below.
unsigned int * a = malloc( sizeof *a * n );

Strength * str = malloc( sizeof *str * ( 2*n + 2 ) );

if ( a == NULL || str == NULL )
{
printf( "Error: not enough memory.\n" );
return EXIT_FAILURE;
}

// Init a.
for ( unsigned int k = 2, * b = a, * aEnd = a + n; b != aEnd; b++, k+=2 )
{
*b = k;
}

// Init str. This is arbitrary.
for ( Strength * s = str, * strEnd = str + 2*n + 2; s != strEnd; s++ )
{
*s = nextafter( 0.0, 1.0 );
}
str[2] = nextafter( 0.0, 1.0 );
str[3] = 0x1.0p-2;
str[4] = nextafter( 0.0, 1.0 );
str[5] = 0x1.0p0;

sort( a, litPairStrengthF, str, n );

return 0;
}

Daniel Vallstrom · Nov 2, 2004

Mark McIntyre said:
Really? Go tell that to Goldberg et al.

No need. This was the closest quote from Goldberg on the subject
I could be bothered to find (Goldberg, What Every Computer
Scientist Should Know About Floating Point Arithmetic (1991), p42):

"Incidentally, some people think that the solution to such anomalies
is never to compare floating-point numbers for equality, but instead
to consider them equal if they are within some error bound E. This is
hardly a cure-all because it raises as many questions as it answer.
What should the value of E be? If x < 0 and y > 0 are within E, should
they really be considered to be equal, even though they have different
signs? Furthermore, the relation defined by this rule,
a ~ b |a - b| < E, is not an equivalence relation because a ~ b and
b ~ c does not imply that a ~ c."

[snap]

Daniel Vallstrom

Villy Kruse · Nov 3, 2004

"Incidentally, some people think that the solution to such anomalies
is never to compare floating-point numbers for equality, but instead
to consider them equal if they are within some error bound E. This is
hardly a cure-all because it raises as many questions as it answer.
What should the value of E be? If x < 0 and y > 0 are within E, should
they really be considered to be equal, even though they have different
signs? Furthermore, the relation defined by this rule,
a ~ b |a - b| < E, is not an equivalence relation because a ~ b and
b ~ c does not imply that a ~ c."

Then both x and y are within epsilon of zero so both should be practically
equal to zero and thus equal to each other. These problems are handled
in the mathematical discipline Numerical Analysis, and it isn't realy
trivial.

Villy

Daniel Vallstrom · Nov 3, 2004

[snips]

My next question is, what is it that you are really
trying to guarantee?
Why are you writing the assertion in the first place? Presumably it's
to guarantee that some other piece of code that depends on that
assertion being true is going to execute at some point in the (perhaps
near) future.

Click to expand...

To reiterate a bit more clearly, I'm guaranteeing what the assertion
says (e.g. (double)x>=(double)y), nothing more, nothing less.

[snip]

Regarding the program, I forgot to say exactly what values the elements
in the original floating point array (str) can take. The answer is *any*
positive value including 0 but not infinite. They can also be of some
special negative value.

[snip]

// Swaps 'a' and 'b' using temp as temporary storage.
#define swapM( a, b, temp ) { (temp) = (a); (a) = (b); (b) = (temp); }

Hmm, did I write that?) With proper use of curly parenthesis it shouldn't
matter but generally it's of course better to write swap macros like this:

#define swapM( a, b, temp ) ( (temp) = (a), (a) = (b), (b) = (temp) )

or

#define swapM( a, b, temp ) \
do { (temp) = (a); (a) = (b); (b) = (temp); } while(0)

Comments on the program in the parent post are welcome.

Daniel Vallstrom

Tim Rentsch · Nov 5, 2004

Tim Rentsch said:
Tim Rentsch said:

[SNIP]

Click to expand...

My next question is, what is it that you are really
trying to guarantee?

Click to expand...

Together with what you write at the bottom of your reply it seems
that you are asking for a full-blown example containing the problem?
I'll try to cut out the relevant parts of the program with the problem
and patch it up to some smaller program.

Yes that's what I was asking. The posted program was very helpful,
just what I was looking for.

I'm due for a followup on this. I've gone about three rounds of
emails with the IEEE-784 folks. The answer is pretty simple once
it's boiled down.

What should happen: what SHOULD happen is that the program should
work, without any need of casting or any 'volatile' specifiers.
The reason for this is how C99 'double's are handled for assignment
and function return values. In particular,

y = ... some expression ...;

while( f(xyz) <= y ) ...

In both of these cases the values for y and f() should be 64 bit
values, because: assignment should cause a value to be converted to
64 bit double; and, a function return result should be converted to
64 bit double. Those rules are how C99 programs are supposed to work.
If these rules were followed, the program would (I'm pretty sure)
simply work, asserts included, with no casting or 'volatile' tricks
required.

What does happen: what DOES happen is that these conversions aren't
done according to specification. Part of the explanation for why this
is so is the x86 architecture - it doesn't [*] have a way to evoke
conversion from 80 bit to 64 bit except by storing in memory and then
loading again. Since this extra storing and loading causes horrible
performance, it normally isn't done (and standard be d***ed

. You
might fault gcc for taking this attitude, but it looks like plenty of
other compilers do exactly the same thing, which is to say, they have
a switch to make the "right" semantics happen but otherwise don't
necessarily follow the "right" semantics exactly.

[*] at least, not in all models.

What is a good way to fix it: at certain places we need to force
conversion to 64 bit doubles. This is best done (IMO) by function
encapsulation:

static inline double
double64( double d ){
return * (volatile double *) &d;
}

and then using 'double64()' as needed in the application. For
the program posted, changing 'litPairStrengthF' so it reads

return double64( varStrPtr[0] * varStrPtr[1] );

seems to fix both the sorting behavior and the assert's -- that is,
the sorting order has changed (and looks right), and the assert's no
longer need 'volatile' to succeed.

Note that this change corresponds to making this function behave
according to the C99 semantics described by the IEEE-784 group.

Disclaimer: I speak for myself, not the IEEE-784 committee. I
believe my comments here are consistent with what I've heard from
them, but these are my comments not theirs.

(I have some other program comments but I think it's better to
do those off-line rather than in the NG. If you'd like those
Daniel please send me an email. Did you get the earlier email
I sent you?)

Tim Rentsch · Nov 5, 2004

Christian Bau said:
You could have a look at the Java language specification, where this
problem is handled. The first Java implementations required that all
double operations are actually done in double precision and not with
higher precision or a higher exponent range. A very reasonable demand
which unfortunately made correct implementations on x86 10 times slower
than incorrect ones in extreme cases. So now Java has "strict fp" and
"non-strict fp". In the "non-strict" case, an extended exponent range is
allowed in absolutely well-defined situations.

You've misunderstood the problem I was asking about. I know about
strict fp in Java. It's perfectly easy (if not always especially
convenient) to get strict fp semantics in C. The question is about
something altogether different, viz., how to get good results even
in the presence of non-deterministically differing precisions at
different points in program execution.

Daniel Vallstrom · Nov 5, 2004

Tim said:
What is a good way to fix it: at certain places we need to force
conversion to 64 bit doubles. This is best done (IMO) by function
encapsulation:

static inline double
double64( double d ){
return * (volatile double *) &d;
}

and then using 'double64()' as needed in the application.

Agreed, this is the cleanest solution. The thing one wants to
be able to do is to double cast variables of type double, for
example (double)x. Since that isn't working, something like
double64(x) --- or perhaps forceDoubleCast(x) --- seems like
the next best thing.

For
the program posted, changing 'litPairStrengthF' so it reads

return double64( varStrPtr[0] * varStrPtr[1] );

seems to fix both the sorting behavior and the assert's -- that is,
the sorting order has changed (and looks right), and the assert's no
longer need 'volatile' to succeed.

One could perhaps do that but it would be different from
the original program. The straight forward way would be to
replace all the constructions involving volatile in the
original program with a double64() construction. After all,
all the volatile constructions in the original program are
nothing but more elaborate ways of doing a double64 call
(or a double cast if that would have worked).

Casting already in the 'litPairStrengthF' function might
not be a good idea for a couple of reasons:

(1) The function could be supplied by an external user and
you don't want her to have to bother with subtle double64
calls. (You could add a double64 wrapper call to all the
litPairStrengthF calls though, but see (2).)

(2) The function is or could be used in other, more time
critical, code. Doing the cast inside the function could
as you say slow down the real program too much. At the
very least it would slow down the sorting. You only
want to force the casts when it is necessary.

(Of course, you had no way of knowing about this, not (1) at least.)

With your double64 call in litPairStrengthF, the sorting order
changes only by accident. One can't guarantee that the values
get sorted also w.r.t 80 bit (in this way). (But I don't care
about that.) E.g. with the initialization below (in addition to
the str init loop), the values still don't get 80 bit sorted even
with your double64 addition in 'litPairStrengthF':

str[7] = 0x1.0p-2;
str[5] = 0x1.0p0;

(I have some other program comments but I think it's better to
do those off-line rather than in the NG. If you'd like those
Daniel please send me an email. Did you get the earlier email
I sent you?)

Feel free to post them here or, if you think that's more appropriate,
email me. I got your mail.

Daniel Vallstrom

float point arithmetic a-a != 0.0	12	Mar 8, 2010
Calculate Y axis distance	5	May 5, 2012
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Question on printf with %x	8	Apr 17, 2013
Floating-point bit hacking: iterated nextafter() without loop?	7	Oct 14, 2004
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
isnan or isntnan?	34	Nov 13, 2010
trouble with pyplot in os x	2	Aug 4, 2012

assert( x > 0.0 && 1 && x == 0.0 ) holding

Christian Bau

Daniel Vallstrom

Daniel Vallstrom

Villy Kruse

Daniel Vallstrom

Tim Rentsch

Tim Rentsch

Daniel Vallstrom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads