Float comparison

Phil Carmody · May 12, 2009

Dik T. Winter said:
Ok, I think that's going to have to be good enough for me--I'm running
out of time to keep up. I think it's a step closer to Chuck's
position--maybe not enough for him, though.

Click to expand...

Not enough at all. Consider the following piece of code (where for
simplicity I assume a > b > 0.0, otherwise it is a bit more
cumbersome) [detypoed]
void add(double a, double b, double *c, double *cc) {
*c = a + b;
*cc = a - *c + b;
}
(and assume also that the value of *c used in the calculation of *cc is the
value actually stored, for some compilers you may need some flags for that.)

Assuming something about the arithmetic on the machine (which IEEE machines
do satisfy), after that operation, a + b == *c + *cc exactly, while *cc is much
less than *c.

It's only a minor point, but I find it more instructive to view
the above as:

void add(double a, double b, double *c, double *cc) {
*c = a + b;
double tmp = *c - a;
*cc = b - tmp;
}

I desperately wanted to add some comments to that to explain
what the final two steps (which you merged), but decided that
if my claim of more instructive were to be supportable, then
it should be so without comments!

This is a basic building block to (approximately) double the maxmimum precision
on a processor using build-in arithmetic only. Assuming that the 'c' stored
is an approximation and not assuming that the 'c' used in the calculation of
'cc' is not the value actually stored in 'c' voids the whole argument.

I may note that this technique (due to T.J. Dekker in the sixties) was used
when IBM implemented double precision on the PC/RT.

For this technique to work it is absolutely *not* necessary to know what the
value stored in 'c' actually is, it is only necessary to know that what did
get in also gets out, and some bounds on what did get in (that is: the
properly rounded result of the addition, called "faithful" in the original
article).

(The other basic block is a routine of the format:
void mul(double a, double b, double *c, double *cc)
where after the operation a * b == c + cc exactly. It uses similar basic
operations but is a bit concoluted.)

There are interesting techniques in full-precision fused multiply-add
architectures where you can create exact 104-bit products from 2 52-bit
mupliplicands. If you don't have such an architecture then it is indeed
more complicated, but with one it's trivial. (Power being my arch of
choice in this regard.)

Phil

CBFalconer · May 12, 2009

Keith said:
.... snip ...

I think you're saying that, for example, given:
double x;
x = 1.0 / 3.0;
you cannot tell, just by examining the contents of x, that the real
value one-third was stored in x.

Well of course you can't, because the real value one-third *wasn't*
stored in x. The value stored in x was some value that's exactly
representable as a double, some value that's merely close to
one-third. And you can tell exactly what that floating-point value
was simply by examining the contents of x.

Concretely, on my system, the value stored in x is exactly
0.333333333333333314829616256247390992939472198486328125
and that's exactly what I see when I examine the contents of x later
in the same program.

You are looking at the value of the fp object, not at what that
value represents. That is where the loose use of the word 'value'
is fouling you, and others, up.

Without looking at the programming that did the storage, you only
know the objects value, and not what that value represents. I
maintain that that value equally well represents any value in the
'range' of the object. That range is normally calculated as
x*(1-EPSILON) through x*(1+EPSILON). That will change with fp
systems, but is typical.

Note that it wasn't the assignment that generated the FP
approximation, it was the "/" operator, which operates on
floating-point values, not on real values. The real value one-third
never occurred in the program, not even in the abstract semantics.
The real value one-third can't even be expressed in C.

Your claim that
it is highly unlikely that you can tell the exact value stored in
that object
is directly contradicted by C99 6.2.4p2:
An object exists, has a constant address, and retains its
last-stored value throughout its lifetime.
Should I believe you, or should I believe the C standard? Or is there
some way I haven't seen that the two are consistent?

Now think about the multiple meanings of 'value'. I am using the
one that expresses what you know about the fp object. The raw,
'exact' value only serves to identify the 'range' covered.

I am NOT discussing the usefulness of utilizing that 'range'. That
will vary with the purpose of the programming.

Phil Carmody · May 12, 2009

Oooh, ooh, I know that - it's 2/1 ;-)

Yours,
Sir Cantor the diagonal

CBFalconer · May 12, 2009

Keith said:
.... snip ...

Again, it's a subset relationship.

Here I am discussing only the above quotation. If you mean, by
'floating point number', the exact value extractable from the fp
object (using the formulat expressed elsewhere), there is only one
such value for any fp object. The closest larger (or smaller) fp
object has a different extractable exact value. Between those two
values, if looked on as reals, there are an infinity of other real
values. In fact the same thing applies between the lowest point in
the lowest fp objects 'range', and the equivalent point in the
higher fp objects 'range'.

To make it clear, create a fp value in object fpa. That covers a
range from the value in fpa, call that fpav, which range is
fpav*(1-EPSILON) through fpav*(1+epsilon). Note that 1-EPSILON and
1+EPSILON are calculable and different from 1.0 (see the standard).

Calculate fpav*(1+EPSILON) and store it in object fpb, giving an
object with the fpvalue fpbv.

Now, ignoring equality in the end points, I maintain that you will
find that:

(fpbv(1-EPSILON) < fpav(1+EPSILON)) == 1; (in that fp system)

....

CBFalconer · May 12, 2009

Dik T. Winter said:
.... snip ...

Not enough at all. Consider the following piece of code (where
for simplicity I assume a > b > 0.0, otherwise it is a bit more
cumbersome):
void add(double a, double b, double *c, double *cc) {
c = a + b;
cc = a - c + b;
}
(and assume also that the value of c used in the calculation of
cc is the value actually stored, for some compilers you may need
some flags for that.)

Assuming something about the arithmetic on the machine (which
IEEE machines do satisfy), after that operation, a + b == c + cc
exactly, while cc is much less than c.

I don't think that is valid. C is specifically allowed to
reorganize the order of computations, so there is no guarantee that
cc is not calculated as *c - *c, or zero. Similarly cc = (a+b)-*c;
This certainly applies to integers, I don't really know if it is
extended to floats.

....

CBFalconer · May 12, 2009

user923005 said:
.... snip ...

You can subtract two numbers of very nearly equal magnitude and
get total loss of precision.

Exactly. This is where the 'range' (as I call it) of the original
values becomes important. For example:

double a, b, c;

a = /* some computation */;
b = /* some other computation */

now assume a and b have been set to the same fp value, say x. a
represents x*(1-EPSILON) through x*(1+EPSILON). So does b. The
actual value we TRIED to store in a was x*(1+EPSILON), and that in
b was x*(1-EPSILON). So the (fairly) accurate value of c=a-b is
2*EPSILOM. In fact the fp system will make it zero. I have
ignored end effects.

If you keep track of the 'ranges' of the values of a and b you can
easily detect when such errors sneak into your computations.
Things may be worse, but they won't be better.

....

Keith Thompson · May 12, 2009

CBFalconer said:
First, I am only dealing with one flavor of floating point at a
time. No need for the complications, which don't affect anything
except the sizes.

Ok, that's fine; I was doing the same thing.

Second, there certainly is a range. If we can't tell the
difference, from the fp value, between that actual value, and a
value just less that x*(1+EPSILON), neglecting the equivalent
identity between values less than x, that value x is covering a
'range'. We are NOT analysing the software that initialized x, we
are analyzing the value stored in x.

There are lots of ranges. You can define as many ranges as you like.

What you're talking about, I think, is the set/range of real numbers
for which x, a floating-point number, is the nearest representable
value. Whether the endpoints of this range are included is ambiguous.
For that matter, you could define other ranges using
truncate-towards-zero, truncate-towards-negative-infinity, and so
forth.

But so what? None of these ranges are *the value of the
floating-point number*. That value is a unique real number (which is
within all of these ranges).

And again, you are making arguments without reference to the standard.
If all this stuff you're saying were valid, surely you'd be able to
cite the exact wording in the standard that explicitly supports it.

The same 'range' I have been talking about all through this. For a
fp value x, it is usually the values from x*(1-EPSILON) through
x*(1+EPSILON). This is fundamental.

And that's were we disagree. It is not fundamental. The C standard
defines a model of floating-point numbers without reference to this
"range" you keep talking about -- and it works.

But the point is you don't know anything about what was stored, or
how it was computed, etc. You are just looking at the fp object
and examining its value.

*sigh*

Yes, you do know exactly what was stored. By looking at the fp object
and examinining its value, you can see what was stored.

There is quite simply *no difference* between "what was stored" and
the result of "examining its value". "An object exists, has a
constant address, and retains its last-stored value throughout its
lifetime", remember?

Does a floating-point object *not* retain its last-stored value
throughout its lifetime? Seriously, what is your answer to that
question?

I said you have omitted the ranges. Without those I showed some
problems.

None that I saw.

[...]

No they can't, if you want to compare the values that were stored.
Nonsense.

Lets store the value A in a, B in b. Neither A nor B fit the PRN
'value' exactly. So a has PRN value ap, and b has PRN value bp.

To be clear, a and b are PRN objects, and A and B are real values,
yes? And those real values are not exactly representable as PRNs
(there are no PRN values whose corresponding real values exactly match
A and B). Therefore it is simply not possible to store A in a, or B
in b. Only a PRN value can be stored in a PRN object.

What you can do, of course, is store a *close approximation* of a
given real value in a PRN object. Thus we can store the PRN value ap
in the PRN object a, and the PRN value bp in in the PRN object b.
Each of these PRN values has a unique corresponding real value; call
them apr and bpr if you like.

Nothing says anything about the relationships between these values
that don't fit the PRN value. A may be less than B, yet ap may be
bp.

Yes, but the real values A and B are *never* stored in PRN, so their
relationship is irrelevant. The "<" operator in (a < b) operates only
on the stored PRN values ap and bp, and is defined in terms of their
corresponding real values apr and bpr.

I'm having a little trouble myself following all the letters flying
around here, so let's make it a bit more concrete. Assume for the
moment that the PRN format uses a decimal base and has 2-digit
precision, one before the decimal point and one after. So there are
exactly 199 distinct PRN values: -9.9, -9.8, -9.7, ..., -0.1, 0.0,
+0.1, ..., 9.7, 9.8, 9.9. (This is an extreme simplification, but
I think we can make it without loss of generality.) Assume that PRN
is a built-in type in our hypothetical C-like language, and that
ordinary floating-point constants represent PRN values, in a very
similar way to how floating-point constants represent floating-point
values in C.

Now consider the following:

PRN a = 1.29;
PRN b = 1.31;

Neither 1.29 nor 1.31 can be stored in a PRN object. So what happens?
The exact value 1.3 is stored in both a and b. The real values 1.29
and 1.31 *are never stored anywhere*, and are irrelevant to any
further operations.

Now (a < b) yields 0, (a == b) yields 1, and (a > b) yields 0.

What's so hard about that?

This is what is regularized by the C standards reference to
EPSILON, and the value where it is applied.

Then show me where the C standard mentions EPSILON, directly or
indirectly, in its definition of how the "<" operator works.
(Hint: It doesn't.)

CBFalconer · May 12, 2009

Dik T. Winter said:
But using the representation makes things better if you want to
double the actual precision. See my other article on how to add
two floating-point numbers to a second pair where the outcome is
exact.

I looked at your previous article, and saw no such article.
Frankly, I don't believe it, but I have been wrong before. I am
NOT allowing for the existence of different precision floating
systems, such as float and double.

CBFalconer · May 12, 2009

Dik T. Winter said:
I have been thinking about this and still do not know how CBF
came to that nonsense. I can only think that CBF does not know
what 'countable' means. It only means that in *some* order there
is a first, a second and so on, and that for every natural number
there is an element in that order, and that for every element in
that order there is a natural number that denotes it.

I have already admitted that I made an error. I am glad to hear
that you have never made a mistake.

Phil Carmody · May 12, 2009

Keith Thompson said:
Flash Gordon said:

Keith Thompson wrote: [snip]

I disagree. The floating-point numbers are a subset of the reals.
The subset obviously does not share all the characteristics of the
full set.

Click to expand...

I agree that they are a subset (apart from the exceptions that we agree on).

5) With floating point numbers you sometimes have (a+b)+c != a+(b+c)

"+" on FP numbers is a different operation than "+" on reals.

Click to expand...

It is because the operators behave differently that I think it is
better to not think of them as being reals, but rather as being a set
of their own which just happens to be a subset of the reals. Just as a
lot of the time one thinks of integers as integers rather than simply
being a subset of reals. I think that it is because they behave
differently to real but CBF thinks of them as reals that he insists
they must be ranges of reals.

My minds gone blank on the word I'm looking for. There is a word in
maths for the set and the operators on it (i.e. like a class in most
OO languages), but I can't for the life of me remember what it
is. However, that is what we are dealing with in C when we want to
know anything more than the current value of a variable.

Click to expand...

The word you're looking for is "group" or "field".

It's been a while since I've studied this stuff, but a group is
basically a set (could be of anything) with an "additive" operation
that has certain properties (it's closed over the set, there's an
additive identity (0), each element has an additive inverse (-x,
such that -x + x = 0), and the operation is transitive ((x+y)+z) =
(x+(y+z))).

For operations, it's "associative". (The same word used in the
operator precedence summaries, and there it means which of the
two bracketing possibilities above are implicit when an expression
'a op b op c' is encountered.)

If the operation happens to be commutative, it's an "Abelian group".

A "field" is a group with some added properties: I think the
"additive" operation has to be commutative, and there's also a
"multiplicative" operation with certain other required properties.

The field elements without the zero must be a group.
A simpler structure with two operations is the 'ring'. Rings don't
require the multiplicative operation to be closed or have inverses,
unlike fields. I simplify greatly, of course, but most details are
not important in this context.

Right. The set of floating-point numbers is a subset of the set of
real numbers, but the group of FP numbers under FP addition is not a
subgroup of the group of real numbers under real addition. In fact, the
FP numbers under FP addition don't form a group at all, since the set
isn't closed under addition. (I suppose you might be able to turn it
into a group by defining the result of "+" for cases whose behavior is
undefined in C, but I don't think it would be worth the effort.)

FP numbers aren't a group as you need an inverse for a group.
(tiny + huge) + -huge is not tiny, so there can be no additive
inverse for huge.

Phil

Flash Gordon · May 12, 2009

Keith said:
Flash Gordon said:

Keith Thompson wrote: [snip]

I disagree. The floating-point numbers are a subset of the reals.
The subset obviously does not share all the characteristics of the
full set.

Click to expand...

I agree that they are a subset (apart from the exceptions that we agree on).

5) With floating point numbers you sometimes have (a+b)+c != a+(b+c)
"+" on FP numbers is a different operation than "+" on reals.

Click to expand...

It is because the operators behave differently that I think it is
better to not think of them as being reals, but rather as being a set
of their own which just happens to be a subset of the reals. Just as a
lot of the time one thinks of integers as integers rather than simply
being a subset of reals. I think that it is because they behave
differently to real but CBF thinks of them as reals that he insists
they must be ranges of reals.

My minds gone blank on the word I'm looking for. There is a word in
maths for the set and the operators on it (i.e. like a class in most
OO languages), but I can't for the life of me remember what it
is. However, that is what we are dealing with in C when we want to
know anything more than the current value of a variable.

Click to expand...

The word you're looking for is "group" or "field".

Yes, that sounds about right.

It's been a while since I've studied this stuff,

Same here.

Right. The set of floating-point numbers is a subset of the set of
real numbers, but the group of FP numbers under FP addition is not a
subgroup of the group of real numbers under real addition. In fact, the
FP numbers under FP addition don't form a group at all, since the set
isn't closed under addition. (I suppose you might be able to turn it
into a group by defining the result of "+" for cases whose behavior is
undefined in C, but I don't think it would be worth the effort.)

I agree it isn't necessarily worth the effort. Mine you, if overflow
(positive or negative) give rise to either a quiet NaN or infinity then
it is closed over addition, and I think this is the case if
__STDC_IEC_559__ is defined (in a C99 implementation) if I read Appendix
F of n1256 correctly. However, a bit of digging tells me that to be a
group the operator also needs to be associative, which addition of
floating point numbers is not. This means that even if you ignore the
limited range floating point types would still not be a group under
addition (unlike integer types which would be if not for limited range,
and unsigned int/long/long-long which are groups under addition).

It is interesting (to me at least) that integers (mathematical ones, not
C ones) are (I think) a subgroup under addition of reals even though C
floating point types are not (different ranges mean you hit infinity at
different points as an easy example).

Getting back on topic, I think the fact that a C floating point type is
not a group (unlike reals under addition) is relevant to the point under
discussion. It means you don't even need to go near numbers (let alone
start talking about about ranges) for the fun to start!

Boy, it's all (or at least some of it) coming back!

Keith Thompson · May 12, 2009

Phil Carmody said:
For operations, it's "associative". (The same word used in the
operator precedence summaries, and there it means which of the
two bracketing possibilities above are implicit when an expression
'a op b op c' is encountered.)

Associative, yes, that's right!

[...]

FP numbers aren't a group as you need an inverse for a group.
(tiny + huge) + -huge is not tiny, so there can be no additive
inverse for huge.

I would have thought -x would be the additive inverse for x, for any
value of x. I think the only requirement for an additive inverse is
that x + -x = 0 (0 being the additive identity). (The Wikipedia
article on additive inverses seems to agree with me.)

But as I said, the set is not closed under addition anyway.

Keith Thompson · May 12, 2009

CBFalconer said:
Here I am discussing only the above quotation. If you mean, by
'floating point number', the exact value extractable from the fp
object (using the formulat expressed elsewhere), there is only one
such value for any fp object. The closest larger (or smaller) fp
object has a different extractable exact value.
Certainly.

Between those two
values, if looked on as reals, there are an infinity of other real
values.

Was that intended as a correction? Your statement is true, but it's a
very different statement from the one Flash made. Flash's statement
is also true.

Between two consecutive floating-point numbers, there are infinitely
many real values. Between two consecutive floating-point numbers,
there are no other floating-point numbers (by definition of the word
"consecutive").

[SNIP}

Keith Thompson · May 13, 2009

William Pursell said:
Actually, it is.

Actually, if we're talking about floating-point numbers as defined by
the C standard, it isn't. Floating-point overflow invokes undefined
behavior.

The set *could* be closed under addition for a particular
implementation, where every addition yields a real value, an infinity,
or a NaN.

Barring traps, any bit pattern
plus any other bit pattern is another bit pattern, and that
bit pattern is a valid float. One problem is
that if a + b == c and c == NaN or c == +/- infinity,
then c + (-b) != a. Unless c is a trap representation,
which I don't believe is possible, the set is in fact
closed under the addition operator.

Yielding a trap representation is certainly one possible consequence
of undefined behavior.

Also, NaN does
not have an inverse.

Agreed. NaNs might be signed, but even if they are, Nan + -NaN yields
a NaN, not 0.0.

More importantly, there exist
a lot of elements that behave poorly under addition.
For example, if the set of floats were a group, then
it should never be true that ( a + b == b && a != 0),
but that clearly happens often. (a = 1.0, b == large number,
for example.)

True. I don't remember whether that particular property violates the
requirements for a group.

Dik T. Winter · May 13, 2009

>
> There are interesting techniques in full-precision fused multiply-add
> architectures where you can create exact 104-bit products from 2 52-bit
> mupliplicands.

I know such architectures, but the technique dates from the sixties when
such architectures were (as far as I know) non-existing.

Dik T. Winter · May 13, 2009

> "Dik T. Winter" wrote:

[correcting an error in the original]

....
> I don't think that is valid. C is specifically allowed to
> reorganize the order of computations,

*If* it is known that that does not change the result. So it is explicitly
*not* allowed for floating-point expressions.

> so there is no guarantee that
> cc is not calculated as *c - *c, or zero. Similarly cc = (a+b)-*c;
> This certainly applies to integers, I don't really know if it is
> extended to floats.

It is not.

Ben Bacarisse · May 13, 2009

See later...

True. I don't remember whether that particular property violates the
requirements for a group.

It does. Using maths rather than C notation, if a + b = b we can
deduce that a + b + -b = b + -b i.e. that a = 0. This is what Phil
Carmody was saying above. Because addition is associative, and x + -x
= 0 (as you say) Phil's expression show that -huge is not an additive
inverse of huge.

Willem · May 13, 2009

Joe Wright wrote:
) William Pursell wrote:
)> For example, if the set of floats were a group, then
)> it should never be true that ( a + b == b && a != 0),
)> but that clearly happens often. (a = 1.0, b == large number,
)> for example.)
)>
) How can (a + b == b && a != 0) ever happen?

I suggest you read up on how floating point numbers work.

For example: Do you believe that if I take an int, cast it to float,
and then back to int again, that you will always get the same nmber back ?

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Keith Thompson · May 13, 2009

Joe Wright said:
How can (a + b == b && a != 0) ever happen?

Rounding error.

#include <stdio.h>
#include <float.h>
int main(void) {
double a = DBL_MIN;
double b = 1.0;
if (a + b == b && a != 0) {
puts("Like this.");
}
return 0;
}

Ike Naar · May 13, 2009

To make it clear, create a fp value in object fpa. That covers a
range from the value in fpa, call that fpav, which range is
fpav*(1-EPSILON) through fpav*(1+epsilon). Note that 1-EPSILON and
1+EPSILON are calculable and different from 1.0 (see the standard).

Just out of curiosity I tried that on my desktop (x86/OpenBSD/gcc+glibc).
Floatingpoint type used is float.

fpav = 1.5
fpav is exactly representable as a float, with bit pattern 0x3fc00000

FLT_EPSILON = 0.00000011920928955078125

The range for fpav, the way you define it,
is fpav*(1-FLT_EPSILON) through fpav*(1+FLT_EPSILON)
is 1.5 * 0.99999988079071044921875 through 1.5 * 1.00000011920928955078125
is 1.499999821186065673828125 through 1.500000178813934326171875

This "range" has an unexpected feature:

prev(fpav), the previous representable float smaller than fpav,
has bit pattern 0x3fbfffff and value 1.49999988079071044921875

next(fpav), the next representable float larger than fpav,
has bit pattern 0x3fc00001 and value 1.50000011920928955078125

That means that prev(fpav) and next(fpav) are well inside fpav's range!
fpav's range, as defined by you, contains three representable floats:
prev(fpav), fpav itself, and next(fpav).

| <---------------------- CBF's "range" ----------------------> |

fpav*(1-eps) prev(fpav) fpav next(fpav) fpav*(1+eps)
----+---------------+---------------+---------------+---------------+----
1.49999982 1.49999988 1.50000000 1.50000012 1.50000018
[no rep] [0x3fbfffff] [0x3fc00000] [0x3fc00001] [no rep]

Looking at these results, your "range" definition seems very peculiar.
What exactly is the benefit of defining a range this way?

Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
How to alter the program so that when user types z or Z or 0, the program sets both a and b to zero?	0	Oct 10, 2022
Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Structures and chained lists questions :	1	Feb 12, 2011
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Runtime Error with __gcd? (floating point exception)	1	Nov 27, 2024
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026

Float comparison

Phil Carmody

CBFalconer

Phil Carmody

CBFalconer

CBFalconer

CBFalconer

Keith Thompson

CBFalconer

CBFalconer

Phil Carmody

Flash Gordon

Keith Thompson

Keith Thompson

Keith Thompson

Dik T. Winter

Dik T. Winter

Ben Bacarisse

Willem

Keith Thompson

Ike Naar

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads