Float comparison

Keith Thompson · May 21, 2009

CBFalconer said:
You snipped the demo, which I am re-inserting below. Note the
first sentence.

I acknowledge that the last part of my statement wasn't entirely
correct. Given an assignment:
obj = expr;
the value of expr is converted to the type of obj. If they're not
already of the same type, that conversion can change the value.

But this is not relevant to your claim that FP values represent
ranges.

[big snip]

No. That's true. Just look at the output of the demo. I inserted
3.33333333333333315e-01 into the float. I found 3.33333343e-01
there. The float can't hold 3.33333333333333315e-01. If you look
at the binary construction of the float, the reason becomes
obvious. It has nothing to do with processing. It has everything
to do with storing in a float. Bits have been chopped off.

Of course a float can't necessarily hold a value of type double. This
is not news to anyone.

How does this imply that, as you claim, it's not possible to
understand what a floating-point value is unless it's stored in an
object?

Given:
double d = some_value;
float f;
the assignment
f = d;
does two things: it converts the value of d from double to float, and
it stores the result of the conversion in f.

Yes, conversions can lose bits.

But a double value that is the result of evaluating an expression is
not altered when it is stored in a double object. The value you store
and the value you retrieve are the same.

Keith Thompson · May 21, 2009

CBFalconer said:
For all those things, just substitute extreme range values for the
input, apply the function (such as multiplication) and see the
extremes of the answer. But don't forget to put that calculated
extreme into the fp-value, so that it becomes a fp-object value.

It would be helpful if you could provide an answer that doesn't
require us to understand and accept your model.

Keith Thompson · May 21, 2009

CBFalconer said:
Keith said:

CBFalconer said:

Keith Thompson wrote: [...]
Then the division is evaluated. The division operation takes two
double operands, with values 1.0 and 3.0, and yields a floating-point
result; on my system, that result is exactly
0.333333333333333314829616256247390992939472198486328125 .

Which is NOT 1.0/3.0.

Click to expand...

It's not 1.0/3.0 if you take 1.0/3.0 to be a real expression. It
*is* 1.0/3.0 if you take 1.0/3.0 to be a C expression of type
double. (On my system; the exact value may vary on other systems.)

1/3 (integer division), 1.0/3.0 (FP division), and 1.0/3.0
(mathematical real division, not supported in C) are three
different expressions with three different values.

You can supply the two integers as integers
in a structure, for example. You can use that to build a whole
rational arithmetic system, but we won't bother.

Click to expand...

Good, because it's irrelevant. There are numerous ways to
represent numbers in C. The way we're talking about here is the
built-in floating-point facility, not some other facility that
might be implemented in C.

Click to expand...

No, it isn't. If you want to talk only about what is going on in
the fp-processor, ensure you write all values in terms it
understands. That means hex displays of the content of the
fp-object. No exceptions.

Any other version requires conversions and approximations. Then
you have to keep track of the errors.

Wrong. 0.333333333333333314829616256247390992939472198486328125 is
not an approximation.

I'm not talking about the "fp-processor". I'm talking about the
semantics of floating-point types in C.

Keith Thompson · May 21, 2009

CBFalconer said:
Oh? Are you denying that 1.0 is an integer? Or that 3.0 is an
integer?

Yes. 1.0 and 3.0 are of type double, not of any integer type.

Remember, 1.0/3.0 was presented as a C expression.

Keith Thompson · May 21, 2009

CBFalconer said:
Look at what the foo becomes when all bits meaning something less
than 1/16 are dropped, and it is then rounded on the basis of the
LS bit remaining. It becomes precisely equal to ymin, which when
stored creates an x value. So it prints as 1.0. Isn't that
obvious?

No.

If it was so bloody obvious, why didn't you just say so in the first
place? And what happened to your claim that "You are not going to be
able to form the fp-value foo"?

Dik T. Winter · May 21, 2009

> "Dik T. Winter" wrote: ....
>
> My opinion has to include a picture of the processor, which is what
> I am trying to avoid in all of this.

But when you are talking about "storing", you *do* include the processor.

> If you insist, I would expect
> the change to occur when the divide routine runs out of bits to
> generate (from some counter) and rounds the result on the basis of
> the next bit computed and discarded.

Well, apparently you do not know how processors do division. A processor
as you describe it can not correctly implement IEEE arithmetic.

> Before that the divide
> algorithm is in process, and there is no definitive point for
> approximation involved. This is only an opinion.

But even in your model the result is not rounded when it is stored, but
before it is delivered by the processor, and so is not an error due to
storing but due to the algorithm.

You should really have a look at the paper by David Goldberg with the title:
What Every Computer Scientist Should Know About Floating-Point Arithmetic.
I think the most accessible version of it is at:
<http://docs.sun.com/source/806-3568/ncg_goldberg.html>

Ben Bacarisse · May 21, 2009

CBFalconer said:
EPSILON is the smallest amount that can be added to 1.0 and form a
fp-object-value larger than 1.0. In other words all smaller values
are ignored. For the tinyfloat system, this is 1/16, which will
get rounded up to 1/8 in processing.

So what are the ranges of the numbers in question? I have been trying
to get this simple information from you.

I don't dare try to do the
calculations myself because I don't know how you interpret the
rounding of 1-ESPILON nor do I know if the calculation is to be all
floating point, or exact up until the final rounding.

So, can you please tell me the ranges you are thinking of for the
numbers I gave:

f range of f
------------------
7/8 6/8 8/8
8/8 8/8 9/8
9/8 8/8 10/8
10/8 9/8 11/8
11/8 10/8 12/8

You know the rounding mode need not be fixed in C, so the ranges
depend on the rounding mode in effect at some time; but what time? Does
the meaning of a float change retrospectively when the rounding mode
changes or do I have to re-store it to make it "mean" the new range?
Is x = x; enough to do this?

This is slightly more rigid than the actual C standard wording, but
the standard phrasing is compatible.

No it is not. There is no ambiguity about the C definition. Even if
the phrasing were ambiguous, an exact formula is given. b^(1-p) with
b and b defined in the same section has one and only one meaning. For
your system it is 1/8, not 1/16.

You are entitled to invent a new term and define it how you like, but
maybe you can appreciate how frustrating it is to be told gain and
again that you mean what C means by the term, only to find out now
that you don't.

Keith Thompson · May 21, 2009

CBFalconer said:
I don't recall that, and as a concession to my failing short-term
memory I am going to stick with my definitions.

They are values that, when stored in an fp-object, will produce the
adjacent (to x, or y) fp-object-value. They are the smallest (or
largest) such value. That is their only purpose.

You snipped the majority of my article, in which I asked you several
specific questions. You did so without marking the snippage,
which I find rude.

Given your definitions of xmin and xmax, I do not care about them.
I find your definitions arbitrary and incoherent.

I asked about the bounds (real numbers) of the range represented
by the FP number x. In text that you ignored, I suggested calling
these bounds x_lower and x_upper. If a floating-point number
represents a real range, surely it makes sense to ask about the
bounds of that range. I'd say it makes very little sense to
talk about the range without considering the bounds that define it.

Here, I'll copy-and-paste my questions again:

[BEGIN QUOTE]

In your model, each floating-point number (call it an
"fp-object-value" if you like) represents a range of real numbers,
correct? If x is a floating-point number, then the real values xmin
and xmax are the lower and upper bounds of that range. Likewise for
y, ymin, and ymax.

If you want to use the names {x,y}{min,max} for some other purpose, we
can refer to the range bounds as x_lower, x_upper, y_lower, and
y_upper.

Let's be clear. x is a floating-point number. In your model, x
represents a range of real numbers. I'm calling the bounds of that
range x_lower and x_upper.

If x and y are consecutive representable floating-point numbers
(nextafter(x, +INFINITY) == y), then one of the following must be
true:

(1) x_upper == y_lower, and the ranges meet at just one point
(2) x_upper < y_upper, and there's a gap between the ranges
(3) x_upper > y_upper, and there's an overlap between the ranges

[END QUOTE]

So, which is it?

Dik T. Winter · May 21, 2009

....
A bit harsh, William:

> I'm a little surprised by this response. I had taken
> you to be someone who is interested in technical
> accuracy and curious about learning new things.
> Lebesgue measure is a concept that accurately
> captures the view you are trying to represent
> with the naive term "vanishingly small", and
> one which is encountered very early in one's
> study of mathematics.

Mark the word "mathematics" here. Most computer scientists and programmers
do have a mathematical education, this did change about 35 years ago.
But even mathematicians will occasionally refer to "vanishing small" without
reference to exact terminology.

Moreover, as far as I know, it does not only consider Lebesgue measure, but
can be used with any kind of measure. When you have a set S and a nonempty
subset S1 and a measure M on sets, S1 is vanishing small with respect to S
when M(S1) = 0 and M(S) > 0.

Dik T. Winter · May 21, 2009

>
> So, in short, neither of these are the ranges you have in mind. It
> would be simpler if you just told us the ranges. Several people are
> trying to get this answer.

I think what CBF has learned back in 1949 was the use of the "ulp" (although
I do not think that term was in use at that time, because I think Wilkinson
coined the term in the 1950's).

Given a floating-point number in a certain representation. Assume no hidden
bits (that makes it a bit more easier to understand, but it can be
adjusted to work with hidden bits). One "ulp" of that floating point number
is the value you get when the least significant digit is set to 1 and all
other digits to 0 (this also works if the f-p system is not binary).
("ulp" means "unit of the least significant position".)

An "ulp" is a semi-relative error. That is, it remains the same for a range
of values but jumps up or down when going out of it. In the 1950's and
1960's much error-analysis was done in terms of "ulp"s. In some reports I
wrote in the 1970's, I specify that the result of a specific routine is
(e.g.) 4 "ulps" from the mathematical value.

Hope this clears something up.

Dik T. Winter · May 21, 2009

> Heh heh. You know, in the other thread, where I pointed out that
> neither Kiki nor Chuckie have any sense of humor, I almost wrote that
> neither of them could even spell the word.

Can you spell the word? My dictionaries say it is "humour".

It was pretty amusing when I had some text in the British spelling to read
out by Apple's pronunciation software (it was in the time of the Mac, some
20 years ago). Alas did it not show the difference between "-ise" and
"-ize" endings and single or double "l"s in words like travel(l)ed.

Yes, spelling.

"I before e except after c, and except in weird"
"A pint is a pound the world around except outside the USA".

CBFalconer · May 21, 2009

Keith said:
.... snip ...

But a double value that is the result of evaluating an expression is
not altered when it is stored in a double object. The value you store
and the value you retrieve are the same.

For heavens sake, forget the double. I just used that as a way to
process something that had more precision than a float, to show how
the float lost the precision. The thing is that the float object
has a different range than the double object. It is possible to
have them both holding the identical number - that just requires a
bunch of trailing 0 bits in the double, and a suitable exponent.

And, that 'expression' can be of any type. When you process the
expression and store it in a float, it may lose precision. It
stops representing a single value (unless you consider the
processing) and represent a range of values.

CBFalconer · May 21, 2009

Keith said:
It would be helpful if you could provide an answer that doesn't
require us to understand and accept your model.

But I have presented the model to enable understanding of the
problems.

Ben Bacarisse · May 21, 2009

Dik T. Winter said:
I think what CBF has learned back in 1949 was the use of the "ulp" (although
I do not think that term was in use at that time, because I think Wilkinson
coined the term in the 1950's).

Given a floating-point number in a certain representation. Assume no hidden
bits (that makes it a bit more easier to understand, but it can be
adjusted to work with hidden bits). One "ulp" of that floating point number
is the value you get when the least significant digit is set to 1 and all
other digits to 0 (this also works if the f-p system is not binary).
("ulp" means "unit of the least significant position".)

An "ulp" is a semi-relative error. That is, it remains the same for a range
of values but jumps up or down when going out of it. In the 1950's and
1960's much error-analysis was done in terms of "ulp"s. In some reports I
wrote in the 1970's, I specify that the result of a specific routine is
(e.g.) 4 "ulps" from the mathematical value.

Hope this clears something up.

If that is what he means why not talk in these terms? The concept of
ulp is well known. What is more, one can extend the analysis (as you
well know) to the relative errors in the results of computations;
something that CBFalconer has not claimed for his "ranges".

I have already suggested that he might have intended a range that
expresses, roughly, the concept of 1/2ulp (see the messages about the
boundary between numbers that convert to x and those that convert to
its successor) and I was told in no uncertain terms that this was not
what was intended.

Now all I am hoping for is to get a direct answer to my question about
one example system.

CBFalconer · May 21, 2009

Joe said:
No, it doesn't. The value of EPSILON is used to compute what?
What do you mean you prefer the msb of the mantissa be the sign bit? Who
asked your preference? The representation of your computer and mine put
the sign bit at b31 (b63 for double) and use b23 (or b52) as the lsb of
the exponent. That's the way it is.

Or are you making this up as we go along?

Please just read the previous answers carefully.

BTW, there is NO such thing as a mantissa. That is for
logarithms. FP systems have a significand.

CBFalconer · May 21, 2009

Keith said:
.... snip ...

Wrong. 0.333333333333333314829616256247390992939472198486328125 is
not an approximation.

Not so. Look in float.h, and read the value of DBL_DIG. That will
show you that all your digits after the last of the succession of
3s are junk, with the authority of the C standard behind it. Form
the values of xmax and xmin for x=that value, and that will give
you the range of values that fp-object can represent. They will
include the value 1/3. Tuck the value of xmax into another
fp-object, and then calculate its range. It will not include 1/3.
The same goes for xmin. Create any value between xmax and xmin,
non-inclusive, and tuck it into another fp-object. That object
will be identical with x, and its range will include 1/3, but it
obviously isn't representing 1/3, because that is not what you put
into it (assuming you didn't pick 1/3 as the value to try). Herein
1/3 means the real value created by dividing 1 by 3 (not
fp-dividing).

Note that you can stuff any number between xmax and xmin
(non-inclusive), including 1/3, into a fp-object and get the same
result. So I am not fussy about which you pick. But if you are
going to add/subtract significands (after reinserting the leading
1, and shifting to match exponents) do it in binary form so you can
see what matters.

CBFalconer · May 21, 2009

Keith said:
Yes. 1.0 and 3.0 are of type double, not of any integer type.

Remember, 1.0/3.0 was presented as a C expression.

1.0 is an integer. 3.0 is an integer. I normally write them as 1
and 3. I am not specifying fp division, nor int division, just
arithmetic. I don't care where you get the value 1/3. I do care
what happens to it when stuffed into a fp-object.

CBFalconer · May 21, 2009

Keith said:
No.

If it was so bloody obvious, why didn't you just say so in the
first place? And what happened to your claim that "You are not
going to be able to form the fp-value foo"?

You couldn't. Didn't I just show that? You formed 1.0 in the
fp-object. At any rate, I gather that you understand that now.

I AM getting impatient.

CBFalconer · May 21, 2009

Keith said:
You snipped the majority of my article, in which I asked you several
specific questions. You did so without marking the snippage,
which I find rude.

I snipped the rest because, as I stated, I was not going to change
nomenclature again.

Given your definitions of xmin and xmax, I do not care about them.
I find your definitions arbitrary and incoherent.

Therefore I didn't want to thrash through another set of names,
attaching the (probably) wrong meanings to them.

I asked about the bounds (real numbers) of the range represented
by the FP number x. In text that you ignored, I suggested calling
these bounds x_lower and x_upper. If a floating-point number
represents a real range, surely it makes sense to ask about the
bounds of that range. I'd say it makes very little sense to
talk about the range without considering the bounds that define it.

Here, I'll copy-and-paste my questions again:

[BEGIN QUOTE]

In your model, each floating-point number (call it an
"fp-object-value" if you like) represents a range of real numbers,
correct? If x is a floating-point number, then the real values xmin
and xmax are the lower and upper bounds of that range. Likewise for
y, ymin, and ymax.

If you want to use the names {x,y}{min,max} for some other purpose,
we can refer to the range bounds as x_lower, x_upper, y_lower, and
y_upper.

Here I don't want to worry about what means what.

Let's be clear. x is a floating-point number. In your model, x
represents a range of real numbers. I'm calling the bounds of that
range x_lower and x_upper.

And I called them xmin and xmax. They are NOT included in the
range. Their only purpose is to be jammed into a fp-object to
force generation of a y > x, or a z < x. They have been calculated
with knowledge of the bit values of the significand. See the 4 bit
significand example.

If we install any value greater than xmin and less than xmax into a
fp-object, we get a copy of x. ymin is less than xmax, so it
generates an x. xmax is greater than ymin, so it generates a y.

If x and y are consecutive representable floating-point numbers
(nextafter(x, +INFINITY) == y), then one of the following must be
true:

(1) x_upper == y_lower, and the ranges meet at just one point
(2) x_upper < y_upper, and there's a gap between the ranges
(3) x_upper > y_upper, and there's an overlap between the ranges

[END QUOTE]

So, which is it?

None, because you are misusing xmax and xmin.

CBFalconer · May 21, 2009

Ben said:
So what are the ranges of the numbers in question? I have been
trying to get this simple information from you.

I don't dare try to do the calculations myself because I don't
know how you interpret the rounding of 1-ESPILON nor do I know
if the calculation is to be all floating point, or exact up
until the final rounding.

I don't understand your problem. In tinyfloat, EPSILON is
precisely 1/16. This is the weight of the bit that will be rounded
in preparing a new fp-object value.

....

So, can you please tell me the ranges you are thinking of for the
numbers I gave:

f range of f

And these are of the form 1.0+something.

9/8 1 + 1/8 In each case the EPSILON is 1/16
10/8 1 + 1/4 It only changes when x is an exact
11/8 1 + 1/4 + 1/8 power of two.

Those ranges are from < f+EPSILON down to > f-EPSILON

Think about WHY EPSILON changes at the power of two. It has to do
with the rounding in computing the real value x*(1+EPSILON) when
that expression is handed to the fp-system.

You know the rounding mode need not be fixed in C, so the ranges
depend on the rounding mode in effect at some time; but what time? Does
the meaning of a float change retrospectively when the rounding mode
changes or do I have to re-store it to make it "mean" the new range?
Is x = x; enough to do this?

Somewhere I specified the rounding. If the dropped bit is a 1,
increase the significand. If a 0, don't increase.

Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
How to alter the program so that when user types z or Z or 0, the program sets both a and b to zero?	0	Oct 10, 2022
Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Structures and chained lists questions :	1	Feb 12, 2011
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Runtime Error with __gcd? (floating point exception)	1	Nov 27, 2024
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026

Float comparison

Keith Thompson

Keith Thompson

Keith Thompson

Keith Thompson

Keith Thompson

Dik T. Winter

Ben Bacarisse

Keith Thompson

Dik T. Winter

Dik T. Winter

Dik T. Winter

CBFalconer

CBFalconer

Ben Bacarisse

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads