Float comparison

CBFalconer · May 20, 2009

Flash said:
Keith Thompson wrote:
.... snip ...

Of equal importance is the simple question "why?"

This, by the way, is why I wanted actual numbers. It makes it far easier
to show certain problems.

We need the fractional things to make things clear:

xmax = 1.0 + 1/16
ymin = (1.0 + 1/8)*(1-1/16) /* < xmax */

Now if you note the bit representation, the smallest bit in all
those numbers represents 1/8. We have to add (or subtract) at
least 1/16 for the fp-system to notice it, and then it is only in
the rounding operation. You are not going to be able to form the
fp-value foo, since it requires bits more than 1 place beyond the
actual values.

This is one reason I keep mentioning the 'usual fp-systems'. I
think all meet this requirement, but I am not certain. I don't
like using the type 'double' above.

Does that satisfy you?

CBFalconer · May 20, 2009

Ben said:
.... snip ...

How can you expect people to follow what you mean with this sort
of confusion? Here, in case it helps, are the ranges for a few
numbers around 1 using the formulae you have often quoted (but I
did not believe because of the overlap) using both your odd
EPSILON (1/16) and the real one (1/8):

range of f using EPSILON=1/16 range of f using EPSILON=1/8
f f*(1-1/16) f*(1+1/61) f*(1-1/8) f*(1+1/8)
-----------------------------------------------------------------
7/8 105/128 119/128 49/64 63/64
8/8 120/128 136/128 65/64 73/64
9/8 135/128 153/128 63/64 81/64
10/8 150/128 170/128 70/64 90/64
11/8 165/128 187/128 77/64 99/64

Notice that you get overlapping ranges with both. These numbers
are the exact rationals that the formula produces.

Are either of these the ranges you have in mind? What, finally,
is EPSILON and does it have any relationship to the ones in the C
standard?

Bear in mind that all of these numbers have a least significant
significand bit of 1/8. When doing arithmetic, anything fed in and
combined will have all bits less than 1/16 stripped. The 1/16 bit
by itself controls all rounding. This is partially the effect of
going through the store/retrieve operation.

So write all your numbers as rationals consisting of the sum of
some number of 16ths and something smaller. The smaller will get
ignored by the fp-system.

Believe it or not, I am becoming less confused as this goes on.

EPSILON has the definition of the C standard. It is for 1.0 and
increasing.

CBFalconer · May 20, 2009

Dik T. Winter said:
I thought I did read "xmax is the smallest such real that can be
*expressed* in the fp-system". Did you write something different?

CBFalconer · May 20, 2009

Dik T. Winter said:
.... snip ...

I thought I did read "xmax is the smallest such real that can be
*expressed* in the fp-system". Did you write something different?

I don't think so, if you replace expressed with represented. It
has to be a generatable number in the fp-system so it can be
stuffed in.

CBFalconer · May 20, 2009

Dik T. Winter said:
(e-mail address removed) writes:
.... snip ...

Currently the 'usual rounding' is the IEEE arithmetic round to
even rule. Or do you have something different in mind?

The fp-systems I am concerned with use positive values and a sign
bit to negate. I expect the single bit that is dropped in
truncating the normalized result to significand length will control
the rounding. I think.

Flash Gordon · May 20, 2009

Ike said:
Or think of a cross compiler.

There is no reason why a cross compiler cannot do this optimisation if a
"normal" compiler can. After all, the person writing it has to be aware
of how the target implements floating point anyway.

CBFalconer · May 20, 2009

Richard said:
.... snip ...

I'm afraid that indicated a tragic but obvious misunderstanding
of how floating point operations work. They _always_ work by
approximation, and the approximation is only exact when you're
lucky, not the other way 'round.

You seem to feel that C programmers are not allowed to use
arithmetic. I will say no more.

CBFalconer · May 20, 2009

Dik T. Winter said:
At what point is (in your opinion) the value one third (if it
ever did exist on the computer) change to an approximation of
that?

My opinion has to include a picture of the processor, which is what
I am trying to avoid in all of this. If you insist, I would expect
the change to occur when the divide routine runs out of bits to
generate (from some counter) and rounds the result on the basis of
the next bit computed and discarded. Before that the divide
algorithm is in process, and there is no definitive point for
approximation involved. This is only an opinion.

CBFalconer · May 20, 2009

Keith said:
.... snip ...

The I implore you to stop using the term "epsilon" or "EPSILON".
Feel free to use DBL_EPSILON if and only if you're referring to the
constant defined in 5.2.4.2.2p11. You have muddied the meaning of
"EPSILON" so thoroughly that I'm no longer willing to figure out
what you mean by it.

I use it because I am considering only one fp-process. I am using
EPSILON to refer to the definition in the C standard, and epsilon
to refer to the real value in effect for any arbitrary
fp-object-value.

This is another area that I have resolved in the process of this
loooong argument.

gwowen · May 21, 2009

(e-mail address removed) wrote:
When you subtract equal fp-object-values you get zero. The range
of that is large, compared to its value. You can figure out the
least possible error from the ranges on the input values to the
subtraction.

You told me the range of x and y.
What is the range represented of z = x * y?

Is it 1 +/- epsilon/2?
Or is it = 1 +/- epsilon + O(epsilon**2), the result from
interval arithmetic?

What is the range represented of z = x ** n?

Is it 1 +/- epsilon/2?
Or is it = 1 +/- n*epsilon + O(epsilon**2), the result from
interval arithmetic?

If it is the former, the ranges represented by the floats have lost
all
the information about the range of their inputs, so what good comes
from thinking of them as ranges?

If it is the latter, then two floats that compare equal represent
different ranges. What the hell use is that?

Keith Thompson · May 21, 2009

CBFalconer said:
Not so. As an elementary demo, I am using double to prepare a
'real value', and float to demonstrate.

#include <stdio.h>
#include <float.h>

/* Show that a value is altered by storing in a float */
void demo(void) {
volatile double d; /* volatile to ensure store/loads happen */
volatile float f;

d = 1.0/3;
f = 0; /* just to ensure d is actually stored */
f = d;

It's the implicit conversion from double to float that changes the
value. The value stored in f by this assignment is a value of type
float, and that value is unaltered by the fact that it's stored in an
object.

It's true that the conversion is performed by the assignment, as
described by 6.5.16.1p2, so you've raised a point that I had
neglected. But it's not a particularly relevant point.

If the program were modified so it only uses objects of type double,
there would be no conversion on the assignment, and the value stored
would simply be the result of evaluating the RHS of the assignment.

You've been implying, I think, with your use of the term
"fp-object-value", that it's not possible to understand what a
floating-point value is unless it's stored in an object. That's
wrong.

printf("DBL_DIG=%d\t d=%.*e\n", DBL_DIG, DBL_DIG+2, d);
printf("FLT_DIG=%d\t f=%.*e\n", FLT_DIG, FLT_DIG+2, (double)f);
}

[snip]

gwowen · May 21, 2009

Lebesgue measure is a concept that accurately

captures the view you are trying to represent
with the naive term "vanishingly small", and
one which is encountered very early in one's
study of mathematics.

Where "very early" means "at the beginning of a
serious undergraduate study, and probably not
even then. (Many undergraduate courses are happy
to deal with the Riemann integral only, leaving
Lebesgue measure as an option for the interested).

Besides, the term "almost everywhere" would be preferable
as it conveys the exact mathematical meaning as well as
being comprehensible to the non-mathematician.

There is nothing wrong
with being mathematically naive, but it is
a bit surprising to see that you are content
to wallow in ignorance. I misjudged you.

That's just gratuitously rude.
I presume you don't speak like that to people in real life,
so why do it on the internet?

Keith Thompson · May 21, 2009

CBFalconer said:
Keith Thompson wrote: [...]

Then the division is evaluated. The division operation takes two
double operands, with values 1.0 and 3.0, and yields a floating-point
result; on my system, that result is exactly
0.333333333333333314829616256247390992939472198486328125 .

Click to expand...

Which is NOT 1.0/3.0.

It's not 1.0/3.0 if you take 1.0/3.0 to be a real expression. It *is*
1.0/3.0 if you take 1.0/3.0 to be a C expression of type double. (On
my system; the exact value may vary on other systems.)

1/3 (integer division), 1.0/3.0 (FP division), and 1.0/3.0
(mathematical real division, not supported in C) are three different
expressions with three different values.

You can supply the two integers as integers
in a structure, for example. You can use that to build a whole
rational arithmetic system, but we won't bother.

Good, because it's irrelevant. There are numerous ways to represent
numbers in C. The way we're talking about here is the built-in
floating-point facility, not some other facility that might be
implemented in C.

Similarly, the fact that I can implement a symbolic algebra system in
which pi is represented as, say, { is_pi => 1 } isn't relevant to the
question of whether C floating-point can represent transcendental
numbers.

We're talking about what C floating-point can represent, not what C
can represent.

The point is that
the exact value has been specified, and the fp-processing has
altered it. My 'range' process on the fp-object-value ties it
down.

No, the exact value one-third was never specified. It could be
specified using mathematical real division, but C doesn't provide
mathematical real division. It provides floating-point division,
which yields floating-point results that differ from the corresponding
(unsupported) real division.

BTW, there is no point in printing digits from your double value
past the length specified by DBL_DIG (see float.h). See my earlier
answer to you and the example showing how the fp-system alters
values.

Yes, there is a point. The 55 digits denote the exact stored
floating-point value. Fewer digits are adequate for most purposes,
but they would not denote the exact stored value.

Note that the failure of decimal FP constants to specify exact values
is exactly why hexadecimal FP constants were introduced in C99.

Keith Thompson · May 21, 2009

Whoops! I meant to declare these as tinyfloat, not double:

tinyfloat x = 1.0;
tinyfloat foo = 1.05859375; // halfway between ymin and xmax
tinyfloat y = 1.125;

We need the fractional things to make things clear:

Again, these should have been tinyfloat, not double.

xmax = 1.0 + 1/16
ymin = (1.0 + 1/8)*(1-1/16) /* < xmax */

Now if you note the bit representation, the smallest bit in all
those numbers represents 1/8. We have to add (or subtract) at
least 1/16 for the fp-system to notice it, and then it is only in
the rounding operation. You are not going to be able to form the
fp-value foo, since it requires bits more than 1 place beyond the
actual values.

This is one reason I keep mentioning the 'usual fp-systems'. I
think all meet this requirement, but I am not certain. I don't
like using the type 'double' above.

Does that satisfy you?

Certainly not. I asked what would be the third line of output; you
didn't answer.

You said "You are not going to be able to form the fp-value foo". But
the C code I presented is perfectly valid, and it will produce *some*
output. By using the FP constant 1.05859375 to initialize foo, *some*
value will be stored in foo. By passing foo to printf, *something*
will be printed.

What is the output? (If it's ambiguous, feel free to say so and
present one or more possibilities.)

Keith Thompson · May 21, 2009

CBFalconer said:
I use it because I am considering only one fp-process. I am using
EPSILON to refer to the definition in the C standard, and epsilon
to refer to the real value in effect for any arbitrary
fp-object-value.

This is another area that I have resolved in the process of this
loooong argument.

So EPSILON is your abbreviation for FLT_EPSILON, DBL_EPSILON, or
LDBL_EPSILON. And epsilon is a value that varies for different values
of x; it seems to be something like (nextafter(x, +INFINITY) - x) --
so it should really be thought of as a function of x.

Is that correct? *Please* give a straight answer to that question.

If I now understand you correctly, EPSILON and epsilon are two *very*
different things. If you had been deliberately trying to confuse the
issue, you could hardly have done a better job.

Richard Bos · May 21, 2009

CBFalconer said:
Which is NOT 1.0/3.0.

Yes, that is in fact exactly what it _is_ on Keith's system.

You can supply the two integers as integers

There are no integers in the division of 1.0 by 3.0.

Richard

Richard Bos · May 21, 2009

CBFalconer said:
You seem to feel that C programmers are not allowed to use arithmetic.

No, but I do feel that _competent_ C programmers know the difference,
and make the distinction, between real arithmetic and floating-point
arithmetic.

I will say no more.

That would be a good thing, but I suspect it won't turn out to be true.

Richard

Richard Bos · May 21, 2009

CBFalconer said:
I have already admitted that I made an error. I am glad to hear
that you have never made a mistake.

It is quite possible that Dik has, during his career, made as many
errors concerning numeric programming as you have made in this one
thread. If so, that is because he has written hundreds of times more
articles about it as you ever will.

Chuck, in this case, you are not only teaching your granny to suck eggs,
you're telling her to hard-boil them first.

Richard

Ben Bacarisse · May 21, 2009

CBFalconer said:
Bear in mind that all of these numbers have a least significant
significand bit of 1/8. When doing arithmetic, anything fed in and
combined will have all bits less than 1/16 stripped. The 1/16 bit
by itself controls all rounding. This is partially the effect of
going through the store/retrieve operation.

So, in short, neither of these are the ranges you have in mind. It
would be simpler if you just told us the ranges. Several people are
trying to get this answer. You have not even answered the question
about EPSILON. You say it is 1/16 and, later, "has the definition of
the C standard" which puts it at 1/8. So, please, what is EPSILON and
what are the ranges of these numbers?

So write all your numbers as rationals consisting of the sum of
some number of 16ths and something smaller. The smaller will get
ignored by the fp-system.

On the off-chance that your second statement is correct (EPSILON is as
per the C standard) and that the rounding is to be the normal
rounding, these are what I get:

f range of f
------------------
7/8 6/8 8/8
8/8 8/8 9/8
9/8 8/8 10/8
10/8 9/8 11/8
11/8 10/8 12/8

Are these, by any chance, the actual ranges that are so fundamental to
what a floating point number represents?

Believe it or not, I am becoming less confused as this goes on.

I am not encouraged to hear you started off confused

EPSILON has the definition of the C standard.

So it's 1/8 yes; not 1/16 as you used in your calculations?

It is for 1.0 and increasing.

What?

Flash Gordon · May 21, 2009

CBFalconer said:
I don't find the decimals easier.

They make it easier to see the gap. However...

Given tinyfloat is the type described above and the line of C
tinyfloat z = 1.0 + 1.0/16.0 - (1.0/8.0 * 1.0/16.0) /2.0;
What value is stored in z and why? This should be the same as the number
you would get if you implemented it.

To save you calculating it, that is half way between your xmax and ymin.
It is easily expressible as a decimal number, the one Kieth used, i.e.
tinyfloat z = 1.05859375;
Or using real arithmetic as (xmax+ymin)/2.

YES. Remember that xmax and ymin are values TO BE INSERTED into an
fp-object in order to generate the nextafter (or nextbefore)
fp-object-value. That is their only purpose. They are not
fp-object-values. If we average them, for example, considering
that xmax > ymin, we get a value < xmax and > ymin.

So what happens if you try and store that value (the average of xmax and
ymin) in an fp-object?

But the only purpose of xmax was to generate something that forced
a y. The only purpose of ymin was to generate something that
forced an x. These numbers were sufficiently different from x (and
y) to force a one bit difference in the significand. They were NOT
the exact fp-object-values formed.

You can still express the average value of the two in the C source code,
so you still need to consider what happens in that case.

Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
How to alter the program so that when user types z or Z or 0, the program sets both a and b to zero?	0	Oct 10, 2022
Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Structures and chained lists questions :	1	Feb 12, 2011
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Runtime Error with __gcd? (floating point exception)	1	Nov 27, 2024
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026

Float comparison

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

Flash Gordon

CBFalconer

CBFalconer

CBFalconer

gwowen

Keith Thompson

gwowen

Keith Thompson

Keith Thompson

Keith Thompson

Richard Bos

Richard Bos

Richard Bos

Ben Bacarisse

Flash Gordon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads