Float comparison

C

CBFalconer

Dik T. Winter said:
I think what CBF has learned back in 1949 was the use of the
"ulp" (although I do not think that term was in use at that time,
because I think Wilkinson coined the term in the 1950's).

Given a floating-point number in a certain representation.
Assume no hidden bits (that makes it a bit more easier to
understand, but it can be adjusted to work with hidden bits).
One "ulp" of that floating point number is the value you get
when the least significant digit is set to 1 and all other digits
to 0 (this also works if the f-p system is not binary). ("ulp"
means "unit of the least significant position".)

I am worrying about the bit that controls the rounding, not the lsb
of the significand. That control bit is about to be dropped, and
the rounding makes the significand as close as possible to the
desired result.

I never went through any formal training in fp-systems. I built my
first one in 1965, and that was in hardware, which made it harder
to modify. Later I build at least two different ones, the last was
used in embedded systems for about 20 years and continuously
improved.

You can see the machine with the 1965 system at:

<http://cbfalconer.home.att.net/firstpc/>

The 1973 or so system you will have to get from Intel. It won a
prize when submitted for the 8080. It wasn't very good. The last
was published in DDJ, about 1979. It was good, but got quite a few
improvements since publication. I lost the source in a disk crash,
combined with theft of backup floppies. I have somebodies retype
of the DDJ article, but I know there are errors. Those last two
are assembly source.
 
C

CBFalconer

Dik T. Winter said:
You may think so, but it is false on current processors. The
rounding rules are as follows:
(1) if the mathematical result is more than halfway two successive
representable numbers, the result will be the larger of those.
(2) if the mathematical result is less than halfway two successive
representable numbers, the result will be the smaller of those.
(3) if the mathematical result is exactly halfway two successive
representable numbers, the result is the number with a least
significant bit of 0.
Note the "exactly halfway". This means that *all* bits in the
mathematical result count. So, in binary, using a four bit
mantissa the rounding is as follows (first column mathematical
result, second column the fp result):

1101.1001 1110
1101.0111 1101
1101.1000 1110
1110.1000 1110

I would have the same results except for the last entry. It
certainly eases some calculations, but fouls the statistics.
 
K

Keith Thompson

CBFalconer said:
Not so. Look in float.h, and read the value of DBL_DIG. That will
show you that all your digits after the last of the succession of
3s are junk, with the authority of the C standard behind it.

Oh? Does the standard use the word "junk"?

My opinion, as you should know by now, is that a floating-point value
represents a single real value. On my system, given:

double x = 1.0/3.0;

that value happens to be exactly
0.333333333333333314829616256247390992939472198486328125 . It is not
0.333333333333333314829616256247390992939472198486328124 , neither is it
0.333333333333333314829616256247390992939472198486328126 .

Now if I write:
double x = 0.333333333333333314829616256247390992939472198486328125;
then a certain 64-bit FP value will be stored in x. There are a
number of other constants I could put there that would cause the same
FP value to be stored, including 0.3333333333333333 .

But 0.333333333333333314829616256247390992939472198486328125 (or
equivalent) is unique in that no rounding is performed when the value
is stored.
Form
the values of xmax and xmin for x=that value, and that will give
you the range of values that fp-object can represent.

Oh, is that what xmax and xmin mean? You recently gave a fairly long
explanation that seemed inconsistent with that.

[snip]
Note that you can stuff any number between xmax and xmin
(non-inclusive), including 1/3, into a fp-object and get the same
result. So I am not fussy about which you pick. But if you are
going to add/subtract significands (after reinserting the leading
1, and shifting to match exponents) do it in binary form so you can
see what matters.

I know what matters, and I don't need to use binary.

But if you'd prefer hexadecimal,
0.333333333333333314829616256247390992939472198486328125
is equivalent to
0x1.5555555555555p-2 .

Note that both of these are numeric representations. I suspect you
wanted a binary representation of the sign, exponent, and significand,
but it's not necessary to go to that level of detail. An FP value
represents a number.
 
K

Keith Thompson

CBFalconer said:
1.0 is an integer. 3.0 is an integer. I normally write them as 1
and 3. I am not specifying fp division, nor int division, just
arithmetic. I don't care where you get the value 1/3. I do care
what happens to it when stuffed into a fp-object.

You don't get to change the context to suit your argument.

The expression 1.0/3.0 appeared as a C expression. If you want to
talk about "just arithmetic", please take it to sci.math. We're
talking about the semantics of C floating-point.

If you're not specifying fp division or int division then you are
off-topic.

1.0 and 3.0 are expressions of type double. Do I need to cite the
standard to prove it?
 
K

Keith Thompson

CBFalconer said:
I don't understand your problem. In tinyfloat, EPSILON is
precisely 1/16. This is the weight of the bit that will be rounded
in preparing a new fp-object value.

I thought that 1 and 1+1/8 were consecutive tinyfloat values, i.e,
nextaftert(1.0, +INFINITY) == 1.0+1.0/8.0 (where nextaftert is the
tinyfloat version of the nextafter function). Then EPSILON is 1.0/8.0
by definition.

[...]
Think about WHY EPSILON changes at the power of two. It has to do
with the rounding in computing the real value x*(1+EPSILON) when
that expression is handed to the fp-system.

EPSILON doesn't change; it's a constant. What you call epsilon does
change.
 
K

Keith Thompson

CBFalconer said:
I snipped the rest because, as I stated, I was not going to change
nomenclature again.

If you snip quoted text, *please* mark it. How difficult is it to
type "[...]"?
Given your definitions of xmin and xmax, I do not care about them.
I find your definitions arbitrary and incoherent.

Therefore I didn't want to thrash through another set of names,
attaching the (probably) wrong meanings to them.
[...]
Here, I'll copy-and-paste my questions again:

[BEGIN QUOTE]

In your model, each floating-point number (call it an
"fp-object-value" if you like) represents a range of real numbers,
correct? If x is a floating-point number, then the real values xmin
and xmax are the lower and upper bounds of that range. Likewise for
y, ymin, and ymax.

If you want to use the names {x,y}{min,max} for some other purpose,
we can refer to the range bounds as x_lower, x_upper, y_lower, and
y_upper.

Here I don't want to worry about what means what.

Um, isn't that the whole point of this interminable thread?
And I called them xmin and xmax.

So xmin and xmax, both of which are real numbers, *are* the bounds of
the real range that is represented by the FP number x? Is that really
what you're saying?
They are NOT included in the
range.
Ok.

Their only purpose is to be jammed into a fp-object to
force generation of a y > x, or a z < x. They have been calculated
with knowledge of the bit values of the significand. See the 4 bit
significand example.

And now you've gone off on a tangent.

The purpose of xmin and xmax is to define the range of values
represented by x. They are fundamental to your conceptual model.
If we install any value greater than xmin and less than xmax into a
fp-object, we get a copy of x. ymin is less than xmax, so it
generates an x. xmax is greater than ymin, so it generates a y.
If x and y are consecutive representable floating-point numbers
(nextafter(x, +INFINITY) == y), then one of the following must be
true:

(1) x_upper == y_lower, and the ranges meet at just one point
(2) x_upper < y_upper, and there's a gap between the ranges
(3) x_upper > y_upper, and there's an overlap between the ranges

[END QUOTE]

So, which is it?

None, because you are misusing xmax and xmin.

Huh???

You've agreed that xmin and xmax are the bounds of the range
represented by the FP number x, right? And ymin and ymax are the
bounds of the range represented by y, where nextafter(x, +INFINITY)
== y.

So given these real numbers xmax and ymin, you're now saying that
*none* of
xmax == ymin
xmax < ymin
xmax > ymin
is true? Or are you saying that the statements about the ranges do
not follow from those relationships?
 
B

Beej Jorgensen

You guys really need to start with your common ground and work from
there. Set up some definitions to set the context for the discussion
(e.g. you should be able to agree if 1.0 is an integer or not, but only
once you know the context!)

It seems you are only diverging at this point.

-Beej
 
G

gwowen

For all those things, just substitute extreme range values for the
input, apply the function (such as multiplication) and see the
extremes of the answer.  But don't forget to put that calculated
extreme into the fp-value, so that it becomes a fp-object value.

There are two extremes. Which one should I store?

You continue to evade the question with your cod-philosophical
non-answers.

Answer the question. When
x = 1.0;
y = 1.0;
z = x * y;

Does z represent the same range as x and y?
 
F

Flash Gordon

CBFalconer said:
Look at what the foo becomes when all bits meaning something less
than 1/16 are dropped, and it is then rounded on the basis of the
LS bit remaining. It becomes precisely equal to ymin, which when
stored creates an x value. So it prints as 1.0. Isn't that
obvious?

So you have a value greater than xmax (the upper limit of your range for
the fp-value 1.0) which when converted to an fp-value gets converted to
1.0, i.e. a range which does not include it?

Just to restate, we have the range around 1.0 only goes up to 1 + 1/8,
and a value greater than 1 + 1/8 which you have just stated gets
converted to 1.0.

This, as far as I can see, meens that your model has all the problems
you associate with the model everyone else has been expounding, the one
in the C standard with each fp-value representing exactly one real value
and there being an infinite number of real numbers which cannot be
represented. In your model you still have an infinite number of real
numbers which cannot be represented (the number above being one of
them), but instead get rounded in to one of your ranges, so where is its
advantage?
 
F

Flash Gordon

CBFalconer said:
I don't recall that, and as a concession to my failing short-term
memory I am going to stick with my definitions.

I will track down the message ID where I first introduced the terms if
you want. I still have the original post somewhere in my sent box.
They are values that, when stored in an fp-object, will produce the
adjacent (to x, or y) fp-object-value. They are the smallest (or
largest) such value. That is their only purpose.

They don't seem to work for that. See the messages with actual numbers
in them.
 
F

Flash Gordon

Beej said:
You guys really need to start with your common ground and work from
there.

We have been trying to establish what CBF's model is by trying to get
him to state it clearly so we can actually find such a point.
Set up some definitions to set the context for the discussion

I did, and CBF changed the meaning of the terms xmin, xmax, ymin, ymax I
defined in his first reply and then proceeded to not answer the question
I asked using the terms I explicitly defined.
(e.g. you should be able to agree if 1.0 is an integer or not, but only
once you know the context!)

It seems you are only diverging at this point.

Feel free to see if you can get any form of consistent useful answers.
 
F

Flash Gordon

CBFalconer said:
Flash said:
CBFalconer said:
Keith Thompson wrote:
[...]
Alright, this has already gone over the calculation of xmax and
then generating y. Summarized:

x = 1.0
xmax = 1.0 + 1/16 (= x + EPSILON = x*(1+EPSILON) = x*(1+1/16)
y = 1.0 + 1/8

now we calculate ymin by using y*(1-EPSILON). Substute the value
of y:

ymin = (1.0+1/8)*(1-1/16)

This is NOT the same as xmax. It differs by 1/8 * 1/16.

I hope this answers your question.
And here we have a result that I have difficulty believing is
what you intended.

Let's put all this in decimal notation, so we can compare
things more easily:

x = 1.0
xmax = 1.0625
ymin = 1.0546875
y = 1.125
I don't find the decimals easier.
They make it easier to see the gap. However...

Given tinyfloat is the type described above and the line of C
tinyfloat z = 1.0 + 1.0/16.0 - (1.0/8.0 * 1.0/16.0) /2.0;
What value is stored in z and why? This should be the same as the number
you would get if you implemented it.

To save you calculating it, that is half way between your xmax and ymin.
It is easily expressible as a decimal number, the one Kieth used, i.e.
tinyfloat z = 1.05859375;
Or using real arithmetic as (xmax+ymin)/2.

But it can't be used in the tinyfloat system.

You CAN type the above line of C code (or the equivalent for double) in
to a text file. You CAN then pass that C code through a compiler and
produce an executable. And YOU are the person saying that the real value
of one third can exist in a C program. You cannot have it both ways and
say that one third can exist but a number that can be exactly typed in
decimal cannot.
For values in the
range from 1.0 to less than 2.0 the least significant bit
represents 1/8. The next bit, which will control rounding,
represents 1/16. All other bits represent smaller quantities, and
are discarded. Remember, this is a digital system.

We already know that. Remember that I have already stated that I have
implemented some floating point arithmetic in assembler. Just to be
clear, this was on a processor which did not have a floating point
processor (although it did have a couple of assembler instructions to
help in doing floating point arithmetic, basically a normalise and a
denormalise instruction).
So your z is handled as 1 + 0. i.e. as x. We can't tell the
difference between:

1000000000
and 1000011111
^_ because this bit controls rounding.

So you are saying that a value greater than the upper bound of your
range for 1.0 which gets converted to 1.0

Now please provide an exact numerical definition of the range of number
which when represented in C code will to converted to 1.0 and explain
why this is not your range. We know that this range is larger than the
range xmin to xmax because we have already found a value greater than
xmax which fits in to this range.
 
F

Flash Gordon

CBFalconer said:
Keith Thompson wrote:
.... snip ...

Almost. EPSILON is the smallest amount that, when added to 1.0 in
the fp-system, results in the next higher fp-value.

That is explicitly NOT how FLT_EPSILON, DBL_EPSILON and LDBL_EPSILON are
defined in the C standard. The C standard defined them as, "the
difference between 1 and the least value greater than 1 that is
representable in the given floating point type, b**(1−p)" (where **
represents, raised to the power of). The number you have to add to 1.0
to get an larger floating-point value could be considerably smaller
(with round-to-nearest it would be approximately half the value of
xxx_EPSILON. Unless, of course, you are using some strange meaning of
the phrase, "the smallest amount that, when added to" that I was not
previously aware of.

Believe me, I haven't been trying to confuse things. I have been
trying to straighten out when I am talking about reals,
fp-object-values, EPSILON, epsilon, etc.

Dik, Keith and I all understand floating point arithmetic. What we
cannot understand is your model because the more you say about it the
less sense it seems to make.
 
R

Richard Tobin

"A pint is a pound the world around except outside the USA".

And even in the USA, it's only true at 95C. US fluid ounces of water
do not weigh an ounce at room temperature.

-- Richard
 
A

Antoninus Twink

You claim it's a "basic idea". Can you explain it with mathematical
precision and yet in a way that a basic person like me can understand?
I suspect you'll find that the task is harder than it looks.

On the off-chance that you really are interested in expanding your mind,
I'll give it a go.

First of all, the term "measure" is a bit arcane - whenever you see it,
think "way of defining the integral of a function". The basic problem
is: we have an intuitive notion of what the integral should be, namely
the (signed) area under the graph of the function. How can we turn this
into a rigorous mathematical definition? (I hope there's no need to
convince you that calculus is an extremely useful thing in all sorts of
engineering etc., so being able to put integration on a solid foundation
is important.)

How would you work out the area under a curve? One way would be to draw
the curve on squared paper, and count the number of squares in the
region under the curve. Of course, the curve may actually pass through
some of the squares, so in that case you'd need to decide whether to
include those squares in your count or not. This would give you an upper
bound (when you include the extra squares) and a lower bound (when you
don't include them) for the area under the curve: you just multiply the
number of squares you counted by the area of each square.

Now you could do the same thing with more finely divided squared papers.
This would give you another, better, upper and lower bound for the area
you're interested in. And then you could use even finer paper again, and
get better bounds, and so on. And you could imagine taking a limit, as
the squares get "infinitely fine". If the limiting upper and lower
bounds agree, you could consider that to be a reasonable value for the
area under the curve.

This is essentially the "Riemann integral".

It works just fine if the curve you're trying to find the area under is
nice enough (say the graph of a continuous function on a bounded
interval). But it has some shortcomings.

For example, consider the function f on the interval [0,1] that takes
value 1 on rational numbers, and 0 on irrational numbers. Intuitively,
there should be no area under this function, since the rationals are "so
sparse" in the reals. But because any interval (in this case, the
horizontal side of any of your squares) contains a rational, any squared
paper you use to measure the area will give you an upper bound of 1, and
a lower bound of 0. As these disagree (in the limit as the squares get
finer), the function f is not integrable in the Riemann sense.

OK, you may say, but f seems like a pretty artificial sort of function
that's been constructed to screw up the definition, and it would never
actually arise in real life. But here's why f is interesting. Enumerate
the rationals in [0,1], say as q_0, q_1, ..., and consider the functions
f_n, where f_n(q_i) = 1 for i=0,...,n and f_n is zero everywhere else.
Each of these functions is Riemann integrable with integral 0 (exercise
for the reader), and the f_n converge to f in an obvious sense
("pointwise convergence"). So we've taken a limit of integrable
functions, and hit something non-integrable!

That ties our hands if we're doing calculus. We'd really like to be able
to "swap" limits and integrals under mild assumptions, and say

lim Int f_n(x) dx = Int lim f_n(x) dx (*)
n->infty n->infty

The above example shows that the Riemann integral is bad for this sort
of thing: not only do the two sides not have to be equal, but even if
the f_n are nice functions, the right hand side need not even exist!

So the point of Lebesgue measure is to give a more general definition of
the integral of a function with better properties with respect to limits
and other things. It should agree with our intuitive squared paper
definition for "nice" functions, but it should be robust enough to cope
with limiting constructions we want to perform when doing calculations.

As you've found by reading the (not exactly enlightening) Wikipedia
article, it's a bit complicated to construct Lebesgue measure, but
that's because once you /have/ constructed it, it's very powerful, and
you don't get something for nothing in this world. In fact, if you
accept the above as motivation, once you strip away the technicalities
you'll find that the Lebesgue integral is defined exactly in such a way
as to make (*) true in the case when (f_n) is a non-decreasing sequence
of positive integrable functions. (And in that case, (*) is called the
Monotone Convergence Theorem.)
 
G

gwowen

You claim it's a "basic idea". Can you explain it with mathematical
precision and yet in a way that a basic person like me can
understand? I suspect you'll find that the task is harder than it
looks.

If we intuitively understand the length of an interval,
the measure of an arbitrary set S is

"What's the length of the smallest interval that I can
slice it up and move around the pieces to form a set
of which S is a subset"

Proviso: you're only allowed to form pieces which
are themselves intervals.

Considering only the real line.

Firstly, define the measure of an interval [A,B) as B-A

Suppose we have subset of the real line S.
We can always create a set T, made up of a countably
many finite disjoint intervals [A_i,B_i) such that S is a
subset of T.

Define the measure of T to be sum of all the (B_i - A_i)
(which might be infinite).

Define the measure of S to biggest number less than the
measures of all T's that cover S. (The infinum, technically)

For example:

So, if S is the countable set {p_n}, pick an arbitrarily
small number e>0, and define T_e as:
For i from 1 to infinity -- if p_n is not already covered,
cover p_n with the interval [p_n - e/2^n, p_n + e/2^n)

So S is a subset of T_e for any e.
The measure of T_e is just the total length these intervals
and is =< e*(1 + 1/2 + 1/4 + ...) = 2e
So we can cover S by a T of arbitrarily small measure.
Therefore S has measure 0.
 
G

Guest

For all those things, just substitute extreme range values for the
input, apply the function (such as multiplication) and see the
extremes of the answer.

call the above CBF's theorum (CBFT)
 But don't forget to put that calculated
extreme into the fp-value, so that it becomes a fp-object value.

I don't follow that bit


so by the CBFT
z_min = x_min * y_min;
and
z_max = x_max * y_max;

Now consider
double z1 = 1.0;
double z2 = 1.0 * 1.0;

z1 and z2 represent different ranges. Are they equal?


--
Nick Keighley

A good designer must rely on experience, on precise, logic thinking;
and on pedantic exactness. No magic will do.
(Wirth)
 
G

Guest

On 22 May 2009 at  6:26, Richard Heathfield wrote:

Lebesgue measure

I read the Wiki page and it seemed fairly understandable (but I have
fragments of a mathematical education).

On the off-chance that you really are interested in expanding your mind,
I'll give it a go.

First of all, the term "measure" is a bit arcane - whenever you see it,
think "way of defining the integral of a function". The basic problem
is: we have an intuitive notion of what the integral should be, namely
the (signed) area under the graph of the function. How can we turn this
into a rigorous mathematical definition? (I hope there's no need to
convince you that calculus is an extremely useful thing in all sorts of
engineering etc., so being able to put integration on a solid foundation
is important.)

How would you work out the area under a curve? One way would be to draw
the curve on squared paper, and count the number of squares in the
region under the curve. Of course, the curve may actually pass through
some of the squares, so in that case you'd need to decide whether to
include those squares in your count or not. This would give you an upper
bound (when you include the extra squares) and a lower bound (when you
don't include them) for the area under the curve: you just multiply the
number of squares you counted by the area of each square.

Now you could do the same thing with more finely divided squared papers.
This would give you another, better, upper and lower bound for the area
you're interested in. And then you could use even finer paper again, and
get better bounds, and so on. And you could imagine taking a limit, as
the squares get "infinitely fine". If the limiting upper and lower
bounds agree, you could consider that to be a reasonable value for the
area under the curve.

This is essentially the "Riemann integral".

It works just fine if the curve you're trying to find the area under is
nice enough (say the graph of a continuous function on a bounded
interval). But it has some shortcomings.

For example, consider the function f on the interval [0,1] that takes
value 1 on rational numbers, and 0 on irrational numbers. Intuitively,
there should be no area under this function, since the rationals are "so
sparse" in the reals. But because any interval (in this case, the
horizontal side of any of your squares) contains a rational, any squared
paper you use to measure the area will give you an upper bound of 1, and
a lower bound of 0. As these disagree (in the limit as the squares get
finer), the function f is not integrable in the Riemann sense.

OK, you may say, but f seems like a pretty artificial sort of function
that's been constructed to screw up the definition, and it would never
actually arise in real life. But here's why f is interesting. Enumerate
the rationals in [0,1], say as q_0, q_1, ..., and consider the functions
f_n, where f_n(q_i) = 1 for i=0,...,n and f_n is zero everywhere else..
Each of these functions is Riemann integrable with integral 0 (exercise
for the reader), and the f_n converge to f in an obvious sense
("pointwise convergence"). So we've taken a limit of integrable
functions, and hit something non-integrable!

That ties our hands if we're doing calculus. We'd really like to be able
to "swap" limits and integrals under mild assumptions, and say

lim         Int f_n(x) dx   =  Int lim       f_n(x) dx           (*)
n->infty                           n->infty

The above example shows that the Riemann integral is bad for this sort
of thing: not only do the two sides not have to be equal, but even if
the f_n are nice functions, the right hand side need not even exist!

So the point of Lebesgue measure is to give a more general definition of
the integral of a function with better properties with respect to limits
and other things. It should agree with our intuitive squared paper
definition for "nice" functions, but it should be robust enough to cope
with limiting constructions we want to perform when doing calculations.

As you've found by reading the (not exactly enlightening) Wikipedia
article, it's a bit complicated to construct Lebesgue measure, but
that's because once you /have/ constructed it, it's very powerful, and
you don't get something for nothing in this world. In fact, if you
accept the above as motivation, once you strip away the technicalities
you'll find that the Lebesgue integral is defined exactly in such a way
as to make (*) true in the case when (f_n) is a non-decreasing sequence
of positive integrable functions. (And in that case, (*) is called the
Monotone Convergence Theorem.)

Seems like a sensible reply. I've just re-posted in the vain belief
that less people have me kill-filed than Mr Twink.
 
G

Guest

...but if you consider Lebesgue measure to be a "basic idea", then I
suspect that your mathematical sophistication exceeds mine by
several orders of magnitude. I took a look at a supposedly simple
explanation of Lebesgue measure, and conclude that either I'm a lot
denser than average or the writer is not very good at explaining a
"basic idea" in such a way as to make it /seem/ basic.

I think basic means "fundamental" not "easy" here.

<snip>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top