Float comparison

Flash Gordon · May 21, 2009

CBFalconer said:
Flash Gordon wrote:
.... snip ...

I did that in another message in the last 2 days to Keith.

I had not seen that at the time I wrote the above. However, you failed
to answer the subsequent question that Keith posed (which is the same as
I would have asked, and now have re-asked in slightly different words).

You often fail to read the entire thread before responding (you have
stated this), and so post advice which has either already been covered
or already been shown to be wrong. So why comment if someone else does
it occasionally?

squeamz · May 21, 2009

... snip ...

You seem to feel that C programmers are not allowed to use
arithmetic. I will say no more.

The crazy ship continues to sail...

Antoninus Twink · May 21, 2009

People do not seem to appreciate my attempt to inject a touch of
humor/humour.

Most people are unlikely to be in the mood for a laugh when they're in
the middle of plowing through your latest absurd misunderstandings in
this great big car wreck of a thread.

Antoninus Twink · May 21, 2009

I can't use Saddams methodology.

Well, this thread seems to be a mighty powerful weapon of mass
distraction.

Antoninus Twink · May 21, 2009

To acknowledge complete ignorance of basic ideas and to announce a
refusal to ever even look into it is somewhat sad, I suppose.

Completely characteristic of Heathfield, though - he's closed-minded in
every conceivable respect.

Kenny McCormack · May 21, 2009

Most people are unlikely to be in the mood for a laugh when they're in
the middle of plowing through your latest absurd misunderstandings in
this great big car wreck of a thread.

Heh heh. You know, in the other thread, where I pointed out that
neither Kiki nor Chuckie have any sense of humor, I almost wrote that
neither of them could even spell the word. But I didn't. I thought it
would be a bit trite to use that tired old cliche.

But here, Chuckie has gone out of his way to demonstrate exactly that.

CBFalconer · May 21, 2009

You snipped the demo, which I am re-inserting below. Note the
first sentence.

--------------- reinsert ----------

Not so. As an elementary demo, I am using double to prepare
a 'real value', and float to demonstrate.

#include <stdio.h>
#include <float.h>

/* Show that a value is altered by storing in a float */
void demo(void) {
volatile double d; /* volatile to ensure store/loads happen */
volatile float f;

d = 1.0/3;
f = 0; /* just to ensure d is actually stored */
f = d;
printf("DBL_DIG=%d\t d=%.*e\n", DBL_DIG, DBL_DIG+2, d);
printf("FLT_DIG=%d\t f=%.*e\n", FLT_DIG, FLT_DIG+2, (double)f);
}

/* ----------------- */

int main(void) {
demo();
return 0;
}

[1] c:\c\junk>a
DBL_DIG=15 d=3.33333333333333315e-01
FLT_DIG=6 f=3.33333343e-01

Click to expand...

---------- end reinsert ----------

It's the implicit conversion from double to float that changes the
value. The value stored in f by this assignment is a value of type
float, and that value is unaltered by the fact that it's stored in an
object.

The use of double is simply to simulate a real value. I am not
about to write and publish a rational arithmetic system for this
discussion. It has demonstrated that the double held a value.
Placing that precise number in a float altered its value. You can
ignore the use of double, and imagine that that type is real. The
values wouldn't be the same, in fact I would have a difficulty
printing them as a single number. But it would show the process.

If I remove the '+2' from the field size specification, the numbers
will look like what most people are used to.

It's true that the conversion is performed by the assignment, as
described by 6.5.16.1p2, so you've raised a point that I had
neglected. But it's not a particularly relevant point.

If the program were modified so it only uses objects of type double,
there would be no conversion on the assignment, and the value stored
would simply be the result of evaluating the RHS of the assignment.

You've been implying, I think, with your use of the term
"fp-object-value", that it's not possible to understand what a
floating-point value is unless it's stored in an object. That's
wrong.

No. That's true. Just look at the output of the demo. I inserted
3.33333333333333315e-01 into the float. I found 3.33333343e-01
there. The float can't hold 3.33333333333333315e-01. If you look
at the binary construction of the float, the reason becomes
obvious. It has nothing to do with processing. It has everything
to do with storing in a float. Bits have been chopped off.

CBFalconer · May 21, 2009

Joe said:
CBFalconer wrote:
.... snip ...

[1] c:\c\junk>a
DBL_DIG=15 d=3.33333333333333315e-01
FLT_DIG=6 f=3.33333343e-01

Click to expand...

.... snip binary display ...

That double is truncated when stored as float should be news to us?

No. However people appear to be doing so.

CBFalconer · May 21, 2009

You told me the range of x and y.
What is the range represented of z = x * y?

Is it 1 +/- epsilon/2?
Or is it = 1 +/- epsilon + O(epsilon**2), the result from
interval arithmetic?

For all those things, just substitute extreme range values for the
input, apply the function (such as multiplication) and see the
extremes of the answer. But don't forget to put that calculated
extreme into the fp-value, so that it becomes a fp-object value.

CBFalconer · May 21, 2009

Keith said:
CBFalconer said:

Keith Thompson wrote: [...]

Then the division is evaluated. The division operation takes two
double operands, with values 1.0 and 3.0, and yields a floating-point
result; on my system, that result is exactly
0.333333333333333314829616256247390992939472198486328125 .

Click to expand...

Which is NOT 1.0/3.0.

Click to expand...

It's not 1.0/3.0 if you take 1.0/3.0 to be a real expression. It
*is* 1.0/3.0 if you take 1.0/3.0 to be a C expression of type
double. (On my system; the exact value may vary on other systems.)

1/3 (integer division), 1.0/3.0 (FP division), and 1.0/3.0
(mathematical real division, not supported in C) are three
different expressions with three different values.

You can supply the two integers as integers
in a structure, for example. You can use that to build a whole
rational arithmetic system, but we won't bother.

Click to expand...

Good, because it's irrelevant. There are numerous ways to
represent numbers in C. The way we're talking about here is the
built-in floating-point facility, not some other facility that
might be implemented in C.

No, it isn't. If you want to talk only about what is going on in
the fp-processor, ensure you write all values in terms it
understands. That means hex displays of the content of the
fp-object. No exceptions.

Any other version requires conversions and approximations. Then
you have to keep track of the errors.

CBFalconer · May 21, 2009

Richard said:
Yes, that is in fact exactly what it _is_ on Keith's system.

There are no integers in the division of 1.0 by 3.0.

Oh? Are you denying that 1.0 is an integer? Or that 3.0 is an
integer?

CBFalconer · May 21, 2009

Richard said:
CBFalconer said:

Yes, you did.

No, it wasn't.

Yes, there is.

Oh, I am capable of understanding the distinction, but I have to
confess I didn't realise you were making it. If it is your claim
that there is exactly one real number that can be specified to a C
program, perhaps you would care to save everyone a lot of time by
identifying which specific real number it is. Then we can point out
some others, which will vastly expand your experience of
programming. If, on the other hand, it is not your claim that there
is exactly one real number that can be specified to a C program,
how many do you think can be so specified? Is it, by any chance,
exactly the same number of reals as there are distinct
floating-point values? Why, I do believe it is!

Maybe we should record this imbecelic reply as a demonstration of
the intelligence level required to use c.l.c. And then maybe not,
it might encourage similar foolishness.

CBFalconer · May 21, 2009

Keith said:
.... snip ...

tinyfloat x = 1.0;
tinyfloat foo = 1.05859375; // halfway between ymin and xmax
tinyfloat y = 1.125;

Certainly not. I asked what would be the third line of output; you
didn't answer.

You said "You are not going to be able to form the fp-value foo". But
the C code I presented is perfectly valid, and it will produce *some*
output. By using the FP constant 1.05859375 to initialize foo, *some*
value will be stored in foo. By passing foo to printf, *something*
will be printed.

What is the output? (If it's ambiguous, feel free to say so and
present one or more possibilities.)

Look at what the foo becomes when all bits meaning something less
than 1/16 are dropped, and it is then rounded on the basis of the
LS bit remaining. It becomes precisely equal to ymin, which when
stored creates an x value. So it prints as 1.0. Isn't that
obvious?

CBFalconer · May 21, 2009

Keith said:
As I recall, somebody else introduced the names xmin, xmax, ymin,
and ymax. You've redefined them in some way that I still haven't
figured out; they were never intended for the sole purpose of
generating the nextafter "fp-object-value". You've associated
ymin with x, and xmax with y.

I don't recall that, and as a concession to my failing short-term
memory I am going to stick with my definitions.

They are values that, when stored in an fp-object, will produce the
adjacent (to x, or y) fp-object-value. They are the smallest (or
largest) such value. That is their only purpose.

CBFalconer · May 21, 2009

Flash said:
CBFalconer said:

Keith said:

[...]
Alright, this has already gone over the calculation of xmax and
then generating y. Summarized:

x = 1.0
xmax = 1.0 + 1/16 (= x + EPSILON = x*(1+EPSILON) = x*(1+1/16)
y = 1.0 + 1/8

now we calculate ymin by using y*(1-EPSILON). Substute the value
of y:

ymin = (1.0+1/8)*(1-1/16)

This is NOT the same as xmax. It differs by 1/8 * 1/16.

I hope this answers your question.

And here we have a result that I have difficulty believing is
what you intended.

Let's put all this in decimal notation, so we can compare
things more easily:

x = 1.0
xmax = 1.0625
ymin = 1.0546875
y = 1.125

Click to expand...

I don't find the decimals easier.

Click to expand...

They make it easier to see the gap. However...

Given tinyfloat is the type described above and the line of C
tinyfloat z = 1.0 + 1.0/16.0 - (1.0/8.0 * 1.0/16.0) /2.0;
What value is stored in z and why? This should be the same as the number
you would get if you implemented it.

To save you calculating it, that is half way between your xmax and ymin.
It is easily expressible as a decimal number, the one Kieth used, i.e.
tinyfloat z = 1.05859375;
Or using real arithmetic as (xmax+ymin)/2.

But it can't be used in the tinyfloat system. For values in the
range from 1.0 to less than 2.0 the least significant bit
represents 1/8. The next bit, which will control rounding,
represents 1/16. All other bits represent smaller quantities, and
are discarded. Remember, this is a digital system.

So your z is handled as 1 + 0. i.e. as x. We can't tell the
difference between:

1000000000
and 1000011111
^_ because this bit controls rounding.

CBFalconer · May 21, 2009

Ben said:
.... snip ...

So, in short, neither of these are the ranges you have in mind.
It would be simpler if you just told us the ranges. Several
people are trying to get this answer. You have not even answered
the question about EPSILON. You say it is 1/16 and, later, "has
the definition of the C standard" which puts it at 1/8. So,
please, what is EPSILON and what are the ranges of these numbers?

EPSILON is the smallest amount that can be added to 1.0 and form a
fp-object-value larger than 1.0. In other words all smaller values
are ignored. For the tinyfloat system, this is 1/16, which will
get rounded up to 1/8 in processing.

This is slightly more rigid than the actual C standard wording, but
the standard phrasing is compatible.

CBFalconer · May 21, 2009

Keith said:
CBFalconer said:

Not so. As an elementary demo, I am using double to prepare a
'real value', and float to demonstrate.

Click to expand...

#include <stdio.h>

#include said:

/* Show that a value is altered by storing in a float */
void demo(void) {
volatile double d; /* volatile to ensure store/loads happen */
volatile float f;

d = 1.0/3;
f = 0; /* just to ensure d is actually stored */
f = d;
printf("DBL_DIG=%d\t d=%.*e\n", DBL_DIG, DBL_DIG+2, d);
printf("FLT_DIG=%d\t f=%.*e\n", FLT_DIG, FLT_DIG+2, (double)f);
}

/* ----------------- */

int main(void) {
demo();
return 0;
}

[1] c:\c\junk>a
DBL_DIG=15 d=3.33333333333333315e-01
FLT_DIG=6 f=3.33333343e-01

Click to expand...

It's the implicit conversion from double to float that changes the
value. The value stored in f by this assignment is a value of type
float, and that value is unaltered by the fact that it's stored in an
object.
.... snip ...

You've been implying, I think, with your use of the term
"fp-object-value", that it's not possible to understand what a
floating-point value is unless it's stored in an object. That's
wrong.

CBFalconer · May 21, 2009

Flash said:
I had not seen that at the time I wrote the above. However, you
failed to answer the subsequent question that Keith posed (which
is the same as I would have asked, and now have re-asked in
slightly different words).

You often fail to read the entire thread before responding (you
have stated this), and so post advice which has either already
been covered or already been shown to be wrong. So why comment
if someone else does it occasionally?

I wasn't 'commenting'. I was aiming you at another reply, that I
thought covered the matter. Or so I thought.

CBFalconer · May 21, 2009

Keith said:
.... snip ...

So EPSILON is your abbreviation for FLT_EPSILON, DBL_EPSILON, or
LDBL_EPSILON. And epsilon is a value that varies for different
values of x; it seems to be something like
(nextafter(x, +INFINITY) - x) -- so it should really be thought
of as a function of x.

Is that correct? *Please* give a straight answer to that
question.

Almost. EPSILON is the smallest amount that, when added to 1.0 in
the fp-system, results in the next higher fp-value. epsilons are
the equivalent real values, calculated as x*EPSILON for any x. The
rounding systems of the usual fp-systems will make epsilon ==
EPSILON for x >=1.0 and < 2.0. AFTER THAT epsilon has been put into
the fp-system.

If I now understand you correctly, EPSILON and epsilon are two
*very* different things. If you had been deliberately trying
to confuse the issue, you could hardly have done a better job.

Believe me, I haven't been trying to confuse things. I have been
trying to straighten out when I am talking about reals,
fp-object-values, EPSILON, epsilon, etc.

Dik T. Winter · May 21, 2009

> "Dik T. Winter" wrote: ....
>
> The fp-systems I am concerned with use positive values and a sign
> bit to negate. I expect the single bit that is dropped in
> truncating the normalized result to significand length will control
> the rounding. I think.

You may think so, but it is false on current processors. The rounding rules
are as follows:
(1) if the mathematical result is more than halfway two successive
representable numbers, the result will be the larger of those.
(2) if the mathematical result is less than halfway two successive
representable numbers, the result will be the smaller of those.
(3) if the mathematical result is exactly halfway two successive
representable numbers, the result is the number with a least
significant bit of 0.
Note the "exactly halfway". This means that *all* bits in the mathematical
result count. So, in binary, using a four bit mantissa the rounding is as
follows (first column mathematical result, second column the fp result):

1101.1001 1110
1101.0111 1101
1101.1000 1110
1110.1000 1110

Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
How to alter the program so that when user types z or Z or 0, the program sets both a and b to zero?	0	Oct 10, 2022
Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Structures and chained lists questions :	1	Feb 12, 2011
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Runtime Error with __gcd? (floating point exception)	1	Nov 27, 2024
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026

Float comparison

Flash Gordon

squeamz

Antoninus Twink

Antoninus Twink

Antoninus Twink

Kenny McCormack

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

Dik T. Winter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads