Float comparison

Keith Thompson · May 18, 2009

CBFalconer said:
I disagree. Example:

double x = 1.0;
double y = 3.0;
double z;

z := x/y;

You mean "=", not ":=".

We handed the fp system a description of the real 1/3. Now we are
worrying about what we get back.

No, x/y is a C expression that yields a value of type double. The
real value 1/3 is not a value of type double.

Yes. At the point where something holds 1.0, something else holds
3.0, and the system has been told to divide.

The division is performed on operands of type double, and yields a
result of type double. It cannot yield the real value 1/3.

Let's expand it a bit:

double divide(double num, double den) {
double result = num / den;
return result;
}

...

double x = divide(1.0, 3.0);

This shouldn't make any difference, but it should make it clearer that
the value assigned to x is a double value.

[...]

Guest · May 18, 2009

"adequate support"...

This whole thing is a tricky area. The standard does not keep up
with the trickery.

So the standard is not definitive in this area? I think this
contradicts things you said earlier. It seems you quote the
standard when it suits you and resort to "maths and physics"
when it doesn't.

This normally doesn't matter, until you are
worrying about exactly what the fp system can do for you. My
argument is primarily based on maths and physics, and the standard
when it agrees. It doesn't disagree, but it is failing to separate
various views.

you mean it doesn't support your view?

I have straightened out various ideas I had at the opening of
this. For example, when we started I didn't know (or care) whether
those values that defined the edges of ranges belonged to the
range.

Doesn't that leave holes in the real number range?

I now KNOW that they do not belong.

I wasn't being snide with you. I simply assumed you hadn't been
reading everything. I have the (probably erroneous) idea that
everything I write is crystal clear and now obvious to all.

!

Keith Thompson and Dik T Winter seemed to have problems
for starters

As I have said before, I make a lousy teacher.

I think you don't have a clear idea in your mind about
what you are trying to explain.

You consider floating point numbers to represent a range
of numbers in the reals. Probably bounded by the values
+/-epsilon. But haven't thought through the
consequences. It seems neither the C standard not the IEEE
standard agree with you.

Richard Bos · May 18, 2009

CBFalconer said:
But I did. I don't care about the 1.0/3.0 expression. I care
about the fact that the system attempted to store the real number
1/3 in the fp object.

No, it didn't. The system never even attempted to _compute_ the real
number one-third, let alone store it anywhere.

Richard

Dik T. Winter · May 18, 2009

> Keith Thompson wrote: ....

> >> [#3] Floating types may include values that are not
> >> normalized floating-point numbers,
> >> ...

Click to expand...

> >
> > Again, that's talking about denormalized numbers, NaNs, and
> > infinities, not about these ranges of yours. Note carefully the word
> > "may". A conforming implementation could have floating-point types
> > that don't have any of these things, so that paragraph needn't apply.
> > If your range model were valid, it would apply equally to such a
> > system. You can't reasonably use paragraph 3 to support your model.

Click to expand...

>
> That simply extends the number of examples given. It does not
> restrict them.

Right, and it does *not* support your opinion. It states that the type
*may* include values that are not normalized floating-point numbers. If
you want to have your ranges included there, it is clear that they are
values that are not normalized floating-point numbers, that is, the type
can include them *in addition* to normalized floating-point numbers (which
are required).

Dik T. Winter · May 18, 2009

> This is a quickie, but floats are NOT closed. Reals are closed.
> For most implementations floats have specific changes each time the
> value is doubled (or halved). If you form:
>
> c = a + b;
> d = c - a;
>
> and examine b and d, they will normally be different.
> Precondition: a is greater than 2 * b, or b is greater than 2 * a.

This does *not* tell you that floats are not closed. Floats are closed.
If I take two floats (a and b), the operand + and calculate a + b, the
result is a float. That means the operation is closed.

What you are writing above is that they do not form a group, but that is
something completely different.

Ben Bacarisse · May 18, 2009

CBFalconer said:
Ben said:

... snip ...

[OK, the thread has moved on, but since I asked the question I
feel I should acknowledge the answer and comment on it, but feel
free to leave this unanswered if you think it will be neater to
follow only the other subthread.]

For most normal implementations, the range is x*(1-EPSILON) to
x*(1+EPSILON), with special considerations when x is a power of 2.

Click to expand...

You need to say what EPSILON is and what special considerations
are needed. Without knowing these things no one can say if your
model is simple or complex, helpful or unhelpful.

Click to expand...

I have said before that EPSILON is the one appropriate to the fp
system. I.e. DBL_EPSILON for doubles, FLT_EPSILON for floats,
LDBL_EPSILON for long doubles.

Then I think these ranges overlap. I thought you did not want them to
be overlapping.

nonsense. Subnormals don't appear until you reach extremely small
values.

It's not nonsense. I did not want any confusion over nextafter(0,
INFINITY) in case you answered my question about zero. Unless we
define a modified nextafter, this is likely to be a subnormal. It
will be on IEC 60559 compliant systems.

No, I didn't say that. xmax is a value that, when stored in a
fp-object, will produce a value larger than x when read back from
that fp-object. If that value is y, it can produce a ymin which,
when stored in a fp-object, will produce a value smaller than y.
That will be x. Note that ymin is NOT equal to xmax. In fact,
ymin < xmax (using real arithmetic).

There are a number of possible values for this xmax. Do you intend
that these ranges overlap and if so by how much?

If they are not intended to overlap then you can define xmax as the
smallest real that "converts" to a float > x and ymin to be the
largest that converts to x (x and y being consecutive floats). It
looks like ymin < xmax always but because the reals are closed under
taking limits, this definition will (for most real FP systems) result
in ymin == xmax. I.e. there will always be a larger ymin and a
smaller xmax until, in the limit, a unique real, b == xmax == ymin,
defines the boundary between the ranges of x and y. I suspect this is
not what you mean because you were adamant that ymin < xmax.

It would help me if you would clear this up. Did you really mean that
I can pick *any* real that converts to y as xmax or did you mean that
xmax is smallest such real? Of course, you might have intended a
third, as yet unspecified, way to pick xmax.

<snip>

Dik T. Winter · May 18, 2009

> Keith Thompson wrote: ....
>
> True. It isn't. However, it was stored. The fp-object didn't
> accept it unchanged.

Wrong. It is not the storing that changed it from 1.0/3.0 to whatever was
stored, it was the calculation that made the change. Or do you assert that
(1.0 / 3.0) * 3.0 == 1.0 should always be true (as nothing is stored ever)?

Dik T. Winter · May 18, 2009

> No, nexafter is a function that returns an FP number adjacent to a
> given one, where "adjacent" means that it's the next representable
> value in the specified direction. (I'm not sure why the direction is
> specified via a second FP number rather than more directly; perhaps
> the function as specified is more useful for certain calculations.)

You should ask IEEE for that, it is part of the IEEE standard. But I think
it is because it is a convenient way to use a single instruction to find
the next number in the direction of infinity, in the direction of -infinity
and in the direction of 0. All three can be useful at times.

Dik T. Winter · May 18, 2009

> EPSILON defines the minimum increment to x which requires a
> different fp-object value. I.e. any smaller increment to x will be
> ignored by the hardware. ....
> Because of the difference in EPSILON I mentioned above. The use of
> EPSILON or nextafter are simply two different ways of referring to
> the same general phenomenom.

Note that your EPSILON is *not* equal to nextafterf(x, 2.0) - x, even when
x in the range (1.0, 2.0). In many cases (exactly one half of the cases
(nextafter(x, 2.0) - x)/2.0 is the minimum increment to x which gives a
different fp_object, and is not ignored by the hardware.

Dik T. Winter · May 18, 2009

> About 60 years ago I had a professor who hammered at all us
> 'students' with these concepts involving numbers, reals, integers,
> limits, etc. and insisted we learn methods that handled them all.
> I have forgotten a good deal of it. He was better at it than I am.

Can you tell me what computer had floating point some 60 years ago?

> And no, I am not including subnormals, NaNs, INFs, etc. Zero is a
> unique thing in floating implementations, necessary because
> multiplication (and division) by zero needs to be recognized. We
> can't just use the smallest representable normalized real.

Oh, well, ICL did get away with it (indeed, it had no zero in floating-point).

Keith Thompson · May 18, 2009

Dik T. Winter said:
You should ask IEEE for that, it is part of the IEEE standard. But I think
it is because it is a convenient way to use a single instruction to find
the next number in the direction of infinity, in the direction of -infinity
and in the direction of 0. All three can be useful at times.

Ok, that makes sense. But INFINITY isn't necessarily available (C99
7.12p4). I suppose you could use one of the HUGE_VAL macros instead.

Flash Gordon · May 18, 2009

CBFalconer said:
Flash Gordon wrote:
.... snip ...

No, but it provides another means of discovering the 'range'. What
I mean by that 'range' of x is those real number values that, when
stored in a fp-object, will result in a stored value exactly equal
to x.

OK, so given the C line

double x = some_constant_expressions;

If the result is that x contains the exact floating point value 1.0,
then some_constant_expression is in the range for 1.0.
If it sets x to the value nextafter(1.0,2.0) then it is in the next
range up.

Is this correct? If so then I can ask me next question...

I am only using one fp-system.

I'm not sure if you mean you are only using one specific implementation,
or if you mean you are only using one specific model.

The existence of EPSILON is
also ephemeral.

I don't know what you mean, but I don't think it is relevant.

The critical thing is the variation in the
significand of the fp value, provided the exponent remains
unchanged (normalized).

C allows for unnormalised numbers, so any model which claims to allow
for all valid C implementations (and even be what the standard means)
MUST allow foe unnormalised numbers.

You can see this easily in hex dumps of
the fp-object.
Irrelevant.

The point of the 'range' is that it defines the inherent error
levels due to the storage format.

Limitations are not necessarily errors. The storage format can be used
with absolutely no error under the right conditions. These are
conditions which *are* met by some applications written in C.

Flash Gordon · May 18, 2009

CBFalconer said:
Not quite. x represents >xmin to <xmax
y represents >ymin to <ymax
x < y
xmax > ymin

and the 'represents' means what happens when you store the value
and then read it back from the fp-object. x can't represent xmax,
because storing xmax produces y, which is > x.

xmax and xmin have the sole purposes of defining the 'range' for x,
when manipulated in the fp-system under discussion.

I admit I have diddled about with the specifications as it became
clearer what I was talking about.

No, they don't overlap. Remember that the store/retrieve cycle is
necessary to make the system exhibit the characteristics. In fact,
to use the fp-system.

No, they are calculated values that will force representation of
the 'adjacent' fp-object when stored. Nothing more.

I haven't kept track. It is quite possible.

xmax is calculated from x to generate y. ymin is calculated from y
to generate x.

Because we have to keep track of what we are talking about. Are
you claiming there is nothing confusing about this discussion?

I thought I was answering that.

I don't know what you are getting at.

I was trying to get a concrete example that we could talk about. For
example the ranges could be 0.9<=valone<1.1 and 1.1<=valoneandabit<1.2
Obviosuly those would not be valid for a C implementaion, and the
decimal numbers specified are mathematical real numbers, as are the
comparison operators mathematical operators.

The answer is 100.0.

No it is not. In fact, that is not even related to the question.

If by
ranges you mean what I defined as ranges, for valone the range is

1.0+DBL_EPSILON .. 1.0-DBL_EPSILON/2 (not including ends)

I want specific numerical values. If the bounds you specify are not
included that is important.

for valoneandabit

100.0*(1+DBL_EPSILON) .. 100.0*(1-DBL_EPSILON/2)

No it isn't. It will be the range of the next representable number above
1.0 in the direction of 100. I.e. a range only a very small amouont
larger than 1.0

again not including the end points.

Which when you get the correct values will be in important point, so
think carefully.

These are running into the
anomaly (for most systems) where the basic value is exactly a power
of 2 and the epsilons change.

I want the values for a REAL implementation, one you have to hand by
preference. Not symbols, but actual real numbers. I accept that other
implementations will be different, but I believe I can prove some of my
points on one specific implementation of *your* choice.

So don't say 1.0+DBL_EPSILON etc, do the calculation and give the exact
number.

Flash Gordon · May 18, 2009

CBFalconer said:
Flash Gordon wrote:
.... snip ...

No. You have the value 0.333333333333. You assume this is the
result of dividing 1.0 by something, and you use the limits of the
'range' of the stored value to find the range of possible
divisors.

No you don't. Please actually read up on the details. This is a well
known technique that was not devised by you or me, but by people FAR
more knowledgeable that either of us. The technique is that you start
with the exact answer you have and work back to find out the exact
question which would have given it. You then compare this question with
the question that was actually asked to determine the error. You
explicitly do NOT start with a range.

Try it, and you will find that is a very narrow range.

For a simple situation your method works. This method works for very
complext situations as well.

Then store those computed possibilities and extract them again,
thus finding out the fp-object-values that will do those things for
you.

If you wish set the divisor to 3, and mess with the numerator to
find a range of possibilities. Don't play with both - that leads
to madness .

This is a method of error analysis, NOT what the C code does. Please
actually read the papers on it.

Currently you are basically saying that the experts are doing completely
the wrong thing without even having looked at what they are doing.
Here is the link I provided before
http://www.physics.arizona.edu/~restrepo/475A/Notes/sourcea/node13.html
Read it, and note that this is what Physicists do in the real world,
since you have been mentioning physics in this thread. This link even
explains *why* this method is used.

Flash Gordon · May 18, 2009

CBFalconer said:
Keith said:

CBFalconer said:

Keith Thompson wrote:
... snip ...
Let's try a simple concrete question. Given the declaration
double x = 1.0;
what *exactly* is the range of real numbers represented by the
stored value of x? Assume a typical FP system with FLT_RADIX==2.
If some aspects of the range are implementation-defined, please
say so.
The upper point (which I have been calling xmax) is exactly at
1.0+DBL_EPSILON. This is spelled out by the C standard. Everything
else is implementation defined. The lower point (xmin) is probably
at 1.0-DBL_EPSILON/2. Note that these two numbers are not within
the range for 1.0, but do delimit it.

Click to expand...

[...]

Thank you for trying to define what this "range" is. You've still
got it wrong, as far as I can tell.

Here's a graph showing five consecutive FP numbers, each of which is
exactly representable as a double; we can use nextafter() to define
their relationships. (View this in a fixed-width font.)

***************
|----|----|----------|----------|
a b c d e
1.0

a is 1.0-DBL_EPSILON
b is 1.0-DBL_EPSILON/2
c is 1.0
d is 1.0+DBL_EPSILON
e is 1.0+DBL_EPSILON*2

You're saying that the range represented by y goes all the way to both
of its neighbors, covering the range marked by asterisks. Unless your
ranges substantially overlap with each other, this doesn't make much
sense.

Click to expand...

Forget a and e.

They are relevant to the discussion.

You have calculated b and d via DBL_EPSILON.

That is obvious.

These mark numbers that cannot be stored in the object c.

<snip>

The C standard states that 1.0+DBL_EPSILON *can* be stored in an C
object of type double. Specifically it is the smallest value greater
than 1.0 that *can* be stored.

Since your repsonse was fundamentally wrong at this point there is no
point in going further.

Flash Gordon · May 18, 2009

CBFalconer said:
This whole thing is a tricky area.

Not really. If you treat it floating point numbers as another type of
number (just as integers and rationals are treated in maths as different
types of numbers) with their own limitations it becomes easy.

The standard does not keep up
with the trickery.

It does not need to.

This normally doesn't matter, until you are
worrying about exactly what the fp system can do for you.

The model in the standard *does* define what it does.

My
argument is primarily based on maths and physics,

Yet you seem to have failed to read the paper on error analysis in
physics that I pointed you at.

and the standard
when it agrees. It doesn't disagree, but it is failing to separate
various views.

The standard describes something other than real maths.

I have straightened out various ideas I had at the opening of
this. For example, when we started I didn't know (or care) whether
those values that defined the edges of ranges belonged to the
range. I now KNOW that they do not belong.

That leads to issues which I will go in to else-thread when I have your
position clarified more.

I wasn't being snide with you. I simply assumed you hadn't been
reading everything.

I could not have pointed out a number of things I have without reading.

I have the (probably erroneous) idea that
everything I write is crystal clear and now obvious to all. As I
have said before, I make a lousy teacher.

It isn't clear, hence more than one person (including me) trying to get
you to specify things exactly.

The model in the C standard, on the other hand, is perfectly clear.

Keith Thompson · May 18, 2009

CBFalconer said:
Flash Gordon wrote: [...]

OK, if that is all you can cope with, then I will ask my question given
EXPLICIT EXACT VALUES. First I need values you will accept. So..

Here are two lines of C code.

int main(void)
{
double valone = 1.0;
double valoneandabit = (valone, 100.0);
/* rest of program irrelevant to discussion */
return 0;
}

Now, I know you have access to a C compiler. If on YOUR SPECIFIC
compiler you were to compile and run the above, what are the EXACT
numerical ranges for valone and valoneandabit. I.e. I am asking for a
concrete numerical answer.

Click to expand...

I don't know what you are getting at. The answer is 100.0.

There was an error in Flash's code. Rather than

double valoneandabit = (valone, 100.0);

he meant to write:

double valoneandabit = nextafter(valone, 100.0);

[...]

CBFalconer · May 18, 2009

.... snip ...

Doesn't that leave holes in the real number range?

Far from it. The only purpose, and use, for those extreme values
(which I called xmax and xmin) is to induce the fp-system to
generate the next larger and smaller fp-values. Thus the 'range'
for x CANNOT include xmax, as it generates y. Similarly for xmin,
which generates the smaller than x value. The only relationship
know is that xmax > ymin, which may seem anti-social.

Note that the 'range' for x is <xmax through x to >xmin.

CBFalconer · May 18, 2009

Flash said:
No you don't. Please actually read up on the details. This is a well
known technique that was not devised by you or me, but by people FAR
more knowledgeable that either of us. The technique is that you start
with the exact answer you have and work back to find out the exact
question which would have given it. You then compare this question with
the question that was actually asked to determine the error. You
explicitly do NOT start with a range.

You aren't considering actuality. You don't have an 'exact
answer'. You have an fp-value, which can represent any real in the
'range' for that value. That 'range' may be tight enough so that
you can ignore it, but if not you have to consider it.

CBFalconer · May 18, 2009

Keith said:
And how exactly does one store a real number value in a floating-
point object? Can you demonstrate with some C code? I assert
that only a floating-point value can be stored in a floating-
point object (where floating-point values are a subset of real
values).

For the value 1/3 you can pass the integers 1 and 3 to a fp
division routine, and store whatever appears.

float storeratio(int num, int denom) {
return (float)num / denom;
}

You don't really need the function, but it clears up what is being
done. The fp-system received two integers, and converted them into
a representation of their ratio. Is there any argument about the
value of the ratio of 1 to 3?

Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
How to alter the program so that when user types z or Z or 0, the program sets both a and b to zero?	0	Oct 10, 2022
Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Structures and chained lists questions :	1	Feb 12, 2011
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Runtime Error with __gcd? (floating point exception)	1	Nov 27, 2024
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026

Float comparison

Keith Thompson

Guest

Richard Bos

Dik T. Winter

Dik T. Winter

Ben Bacarisse

Dik T. Winter

Dik T. Winter

Dik T. Winter

Dik T. Winter

Keith Thompson

Flash Gordon

Flash Gordon

Flash Gordon

Flash Gordon

Flash Gordon

Keith Thompson

CBFalconer

CBFalconer

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads