inaccurate floating point and DBL_MAX

Siemel Naran · Mar 10, 2005

About inaccurate floating point and DBL_MAX.

double x = 2;
double y = 1 + 1;
assert(x == y);

Because of inaccurate floating point representation, x and y may not be
equal (ie. they may differ by 0.0000000001 or some small number) on some
implementations.

But how about DBL_MAX?

double x = DBL_MAX;
f(x);
assert(x != DBL_MAX);

Is this above comparison exact, or should we say assert(x < DBL_MAX).

Thanks.

Ron Natalie · Mar 10, 2005

Siemel said:
About inaccurate floating point and DBL_MAX.

double x = 2;
double y = 1 + 1;
assert(x == y);

Because of inaccurate floating point representation, x and y may not be
equal (ie. they may differ by 0.0000000001 or some small number) on some
implementations.

Actually, I've never seen such an implementation. The standard says that
if you have values that can not be precisely represented in a floating
variable then one of the two adjacent values is chosen. THIS IS WHERE
THE IMPRECISION COMES FROM along with the assumption that non-repreating
decimal fractions are non-repeating binary fractions as well.

I've never seen an implementation that can't precisely represent 1 and 2.

But how about DBL_MAX?

double x = DBL_MAX;
f(x);
assert(x != DBL_MAX);

Since DBL_MAX is of type double, it's already got a double representation.
There's no imprecision. You can test it for equality (provided you haven't
converted it to some other type or done calculations with it).

Jerry Coffin · Mar 11, 2005

Siemel said:
About inaccurate floating point and DBL_MAX.

double x = 2;
double y = 1 + 1;
assert(x == y);

Because of inaccurate floating point representation, x and y may not
be equal (ie. they may differ by 0.0000000001 or some small number)
on some implementations.

This is incorrect for a couple of reasons -- first of all, '1+1' is an
integer expression so it is done as an integer computation, then the
result is converted to a double. IOW, in both cases, you're creating 2
(as an integer) and then converting that integer to a double. The
result is clearly the same in both cases.

Even if you changed it to something like:

double x = 2.0;
dobule y = 1.0 + 1.0;
assert (x==y);

the assertion still can't fail on any conforming implementation of C++.
Even for a floating point type, there is a range of integers that must
be represented exactly, and 1 and 2 fall (well) inside of that range.

But how about DBL_MAX?

double x = DBL_MAX;
f(x);
assert(x != DBL_MAX);

Is this above comparison exact, or should we say assert(x < DBL_MAX).

DBL_MAX is (by definition) the largest possible double. That means
x!=DBL_MAX and x<DBL_MAX mean exactly the same thing unless you want to
deal with infinity, NaNs, etc. As far as precision goes, however,
DBL_MAX is already a double value, so it's not rounded during
assignment to a double.

Java OpenJDK Floating Point Dare	3	Jan 17, 2023
C++ SSE and SSE2 compiler settings, and their Floating Point effects.	0	May 31, 2022
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Avoiding NaN and Inf on floating point division	14	Jan 4, 2014
Floating-point bit hacking: iterated nextafter() without loop?	7	Oct 14, 2004
floating point conversions && how to read standards	4	Oct 7, 2011
Floating-point promotion behaviour.	25	Nov 19, 2010
binary for floating point numbers - small?	1	Jan 28, 2011

inaccurate floating point and DBL_MAX

Siemel Naran

Ron Natalie

Jerry Coffin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads