Trouble with integer floating point conversion

jacob navia · Dec 13, 2007

James said:
Well, actually, the numerical limits section does specify the precision.
However, getting an inaccurate value with high precision is not
something most numerical analysts would be happy about.

It is completely meaningless.

rembremading · Dec 13, 2007

Thanks for your replies. It helps a lot.
In order to conclude:

double any_function()
{

}

(1) double temp1, temp2;
(2) temp1 = any_function();
(3) temp2 = temp1 - any_function();

can lead to a result temp2 != 0.0 because temp1 is stored in the memory and
the result of any_function() in (2) is therefore rounded to the closest 64
bit number, whereas the result of any_function() in (3) is stored the
register only, such that the difference in (3) can have a finite outcome.

Because of the precision discussion I would like to remind that in terms of
C numbers this result ~ 10e-17 *is* equal to 0.0 (in the context of the
subtraction of two order 1 numbers).
In this sense it is not an issue of missing precision of the '-' operation.
I think that actually the comparison
if(floating_point_number_1==floating_point_number_2)
is nonsense, if the floating point variables are the result of some
calculation. And I am doing this comparison only because of performance
issues.

On the other hand in

double temp1, temp2, temp3;
temp1 = any_function();
temp2 = any_function();
temp3 = temp1 - temp2;

temp3 == 0 because temp1 and temp2 are stored in the memory and therefore
the result of any_function is rounded in the same way.
Is this correct?

Unfortunately, when I define temp1 and temp2 as register variables
register double temp1, temp2;
Then, surprisingly, I still get a finite temp1. (of course register
variables are not guaranteed to end up in the register)

The solution for my concrete problem will probably be to choose the
numerical constants such, that the numbers occurring are exactly
representable numbers, which should be possible.

-Andreas

jacob navia · Dec 13, 2007

rembremading wrote:
[snip]

Can you tell me if you use the following program the result is better?

This program sets the precision to 64 bits instead of
80 bits. (If I have no bugs... )

void SetPrecision64(void)
{
void (*fn)(void);
unsigned char code[] = {
0x50, // push %eax
0xd9,0x3c,0x24, // fnstcw (%esp,1)
0x8b,0x04,0x24, // mov (%esp,1),%eax
0x0f,0xba,0x34,0x24,0x08, // btrl $0x8,(%esp,1)
0x66,0x81,0x0c,0x24,0x00,0x02,// orw $0x200,(%esp,1)
0xd9,0x2c,0x24, // fldcw (%esp,1)
0x59, // pop %ecx
0xc3 // ret
};
fn = (void (*)(void))code;
fn();
}

James Kuyper · Dec 13, 2007

jacob said:
It is completely meaningless.

3.9975222353 is a very precise but inaccurate answer to the question
"what is 2.0 + 2.0?".

The numerical limits section describes and puts minimum requirements on
the precision; while section 5.2.4.2.2 makes explicit the fact that the
standard imposes no requirements on the accuracy.

Ben Bacarisse · Dec 13, 2007

rembremading said:
(1) double temp1, temp2;
(2) temp1 = any_function();
(3) temp2 = temp1 - any_function();

can lead to a result temp2 != 0.0 because temp1 is stored in the memory and
the result of any_function() in (2) is therefore rounded to the closest 64
bit number, whereas the result of any_function() in (3) is stored the
register only, such that the difference in (3) can have a finite
outcome.

<obsession type="favourite">
You mean non-zero rather than finite both here and later on.
</obsession>

Dik T. Winter · Dec 13, 2007

>
> If there isn't even an accepted standard that is enforced, how
> can you guarantee anything.

Have a look at the Ada standard, where that is precisely the case. But
the C standard only goes halfway there. They properly define a model
for floating-point representation, but give no guarantee about the
precision within that model, and that is something the Ada standard
does do. What does happen in Ada systems if the maximal model for
the floating-point hardware does not meet the requirements of precision,
the model is reduced.

Thad Smith · Dec 13, 2007

rembremading said:
> The following piece of code has (for me) completely unexpected

behaviour.

This is cross-posted to comp.std.c. See the final paragraph for a
standards question. This is the second posting attempt.

> (I compile it with gcc-Version 4.0.3)
> Something goes wrong with the integer to float conversion.
> Maybe somebody out there understands what happens.
> Essentially, when I subtract the (double) function value GRID_POINT(2) from
> a variable which has been assigned the same value before this gives a
> non-zero result and I really do not understand why.
>
> The program prints
> 5.000000000000000000e-01; -2.775557561562891351e-17;
> 0.000000000000000000e+00
>
> And the comparison
> if(temp1==GRID_POINT(2))
> has negative outcome.
>
> When I replace
> ((double)(i)) by 2.0
> everything behaves as expected. So the output is
> 5.000000000000000000e-01; 0.000000000000000000e+00; 0.000000000000000000e+00
>
> But: even if the integer to float conversion is inexact (which, I think,
> should not be the case) something like
> temp1 = GRID_POINT(2);
> temp3 = temp1-GRID_POINT(2);
> should still result in temp3==0.0, whatever function GRID_POINT does.
>
> What do You think?
>
> Thank you!
>
> ---------------------------------------------------
> #include <stdio.h>
>
> double GRID_POINT(int i);
>
> double GRID_POINT(int i)
> {
> return ( 0.1 + ( (80.1-0.1)/(400.0) )*((double)(i)) );
> }
>
> int main (void) {
>
> double temp1, temp2, temp3;
>
> temp1 = GRID_POINT(2);
> temp2 = GRID_POINT(2);
> temp3 = temp1-GRID_POINT(2);
>
> printf("%.18e; %.18e; %.18e\n", temp1, temp3, temp1-temp2 );
>
> if(temp1==GRID_POINT(2)){
> printf("these two are equal\n");
> }
>
> return 0;
> }
> ---------------------------------------------------

The original code temp3 may be non-zero because of extra precision:

5.2.4.2.2:
The value of operations with floating operands and values subject to the
usual arithmetic conversions and of floating constant are evaluated to a
format whose range and precision may be greater than required by the
type. The use of evaluation formats is characterized by the
implementation-defined value of FLT_EVAL_METHOD.

The standard requires that values be rounded to their specified
precision by storing into a variable of the type or cast to the type
(See 5.1.2.3 footnote 4). Neither of these requirements are in effect
for the use of GRID_POINT(2) within the statement computing temp3.

The compiler is probably passing the evaluated value back in a floating
point register with more precision than a double variable.

Regarding the effect of replacing (double)i by the constant 2, you
permit the compiler to evaluate the expression in GRID_POINT() at
compile time. The compiler probably is then propagating that constant
to the assignments in main. What is apparently different in the
compile-time evaluation is that the value computed in GRID_POINT() gets
rounded to double precision before it is used in main.

Is the compiler rounding of the constant value, compared with the
register passing with extra precision, conforming? 5.1.2.3 footnote 4
says "an implicit spilling of a register is not permitted to alter the
value." Is the compile time evaluation, vs. runtime evaluation, an
implicit spilling of a register? If not, what is an "implicit spilling
of a register"?

James Kuyper · Dec 13, 2007

jacob said:
That is fine but we are speaking about the accuracy here

While the original poster's problem was with the accuracy of the
calculations, your very first response was about the precision, not the
accuracy. An 80-bit number can hold an inaccurate number with much
greater precision than an 64-bit number; in itself that doesn't make the
number any more accurate.

The standard does say things about the precision. The values for
LDBL_EPSILON, DBL_EPSILON, and FLT_EPSILON depend very directly on the
precision of floating point operations, and the standard sets a maximum
value for the magnitude of that constant, which implies a minimum number
of bits of precision.

The standard also says things about when it's permissible to use a
higher precision type for intermediate steps in calculations, and it
seems to be precisely that issue which is the actual reason for the
discrepancies that the original poster reported. See 5.2.4.2.2p8. If
FLT_EVAL_METHOD were 0, then he shouldn't have seen those discrepancies.
For any other value of FLT_EVAL_METHOD, such discrepancies are always a
possibility.

Mark Bluemel · Dec 13, 2007

jacob said:
rembremading wrote:
[snip]

Can you tell me if you use the following program the result is better?

http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html
(previously cited) gives a briefer version, which may be specific to
gcc, but that's what the OP has the issue with ....

jacob navia · Dec 13, 2007

Mark said:
jacob said:

rembremading wrote:
[snip]

Can you tell me if you use the following program the result is better?

Click to expand...

http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html
(previously cited) gives a briefer version, which may be specific to
gcc, but that's what the OP has the issue with ....

Yes, the one there is better adapted for gcc.

That document CONFIRMS what I am saying since the start.

All this problem can probably be solved by
changing the precision of the machine!

rembremading · Dec 13, 2007

Yes, your assembler code works fine, although I don't understand what it
does

For my purposes, however, it will be better to choose the constants such
that they are exactly representable, as outlined below.
Every additional function call, at this place in the code, is unwanted.
(although an inlined version of this function would be quite fast, I guess)
I would not want to activate the 64 bit computations for the entire program.
Actually for the most part of it calculation with higher precision are very
wellcome, as long as it does not affect the speed)
Nevertheless it is an interesting piece of code, which might be usefull in
the future.

CBFalconer · Dec 13, 2007

James said:
(e-mail address removed) wrote:
.... snip ...

Well, actually, the numerical limits section does specify the
precision. However, getting an inaccurate value with high
precision is not something most numerical analysts would be
happy about.

Please describe how you implement infinite precision in a finite
number of bits.

jacob navia · Dec 13, 2007

rembremading said:
Yes, your assembler code works fine, although I don't understand what it
does
For my purposes, however, it will be better to choose the constants such
that they are exactly representable, as outlined below.
Every additional function call, at this place in the code, is unwanted.
(although an inlined version of this function would be quite fast, I guess)
I would not want to activate the 64 bit computations for the entire program.
Actually for the most part of it calculation with higher precision are very
wellcome, as long as it does not affect the speed)
Nevertheless it is an interesting piece of code, which might be usefull in
the future.

You need it call it ONCE when your program STARTS.

Then, all calculations are done in 64 bits only
and not in 80.

That code sets the precision flag of the floating point unit
to 64 bits, i.e. double precision.

Mark Bluemel · Dec 13, 2007

rembremading said:
Yes, your assembler code works fine, although I don't understand what it
does

Then read http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html

It answers your questions fairly comprehensively and gives a simple
way of disabling extended precision for the duration of a program's
execution.

The mechanism outlined is the simplest way of "fixing" your problem.

rembremading · Dec 13, 2007

Yes, I understand that I would have to call it only once, but in the main
threefold for loop of my program I have to do something like

res = ( (1.0+f1)*(1.0+f2)*f3*f4 - f1*f2*(1.0+f3)*(1.0+f4) );

Where f1..f4 are function values (not of the GRID_POINT function), which can
take values from 1.0e-100 to 1.0e2.
In this case, presumably, the intermediate values are stored in the register
and it is obvious that res is highly sensitive to the numerical precision
of f1..f4 and the intermediate results of the computation.
Therefore, if I can get the computation with 80 bit precision instead of 64
bit (at same costs), I definitely prefere the former one.
Although the final result is rounded to 64 bit precision this extra digits
can change a lot.

Because my system has dimension ~1e3 and the loop is threefold I cannot use
arbitrary precision math libraries. I cannot even use long double type
variables on my machine.

Now, the GRID_POINT function has to be evaluated when f1..f4 are computed.
Therefore I would have to switch 64 bit precision on (and off again) in
every single step. That is what I meant.

That code sets the precision flag of the floating point unit
to 64 bits, i.e. double precision.

I understood that, but I do not understand how the function does it
(possibly lack of assembler knowledge)

There might be also a lack of English knowledge, which could be the reason
for missunderstandings. Sorry for that.

-Andreas

jacob navia · Dec 13, 2007

rembremading said:
Yes, I understand that I would have to call it only once, but in the main
threefold for loop of my program I have to do something like

res = ( (1.0+f1)*(1.0+f2)*f3*f4 - f1*f2*(1.0+f3)*(1.0+f4) );

Where f1..f4 are function values (not of the GRID_POINT function), which can
take values from 1.0e-100 to 1.0e2.
In this case, presumably, the intermediate values are stored in the register
and it is obvious that res is highly sensitive to the numerical precision
of f1..f4 and the intermediate results of the computation.
Therefore, if I can get the computation with 80 bit precision instead of 64
bit (at same costs), I definitely prefere the former one.
Although the final result is rounded to 64 bit precision this extra digits
can change a lot.

Because my system has dimension ~1e3 and the loop is threefold I cannot use
arbitrary precision math libraries. I cannot even use long double type
variables on my machine.

Now, the GRID_POINT function has to be evaluated when f1..f4 are computed.
Therefore I would have to switch 64 bit precision on (and off again) in
every single step. That is what I meant.

I understood that, but I do not understand how the function does it
(possibly lack of assembler knowledge)

There might be also a lack of English knowledge, which could be the reason
for missunderstandings. Sorry for that.

-Andreas

Your problem is a very complicated one Andreas. Maybe you will get
better advice
in sci.math.num-analysis, a group that is dedicated to this kind
of problems. I have seen discussions about problems of this type there,
and they have a math level that goes WAY beyond what we have here in
comp.lang.c.

I think you have to rearrange the terms in your equation to make them
less sensible to roundoff errors. I think in comp.math.num-analysis
you will find people that can help you to do that.

Keith Thompson · Dec 13, 2007

jacob navia said:
jacob navia said:

Flash Gordon wrote: [...]
Since the standard doesn't force even IEEE754, there is *nothing*
the C language by itself can guarantee the user.

Click to expand...

(I think jacob wrote the above; the attribution was snipped.)

Are you seriously trying to claim that the only way to provide any
form of guarantee on floating point is to enforce IEEE754?

Click to expand...

If there isn't even an accepted standard that is enforced, how
can you guarantee anything.
[...]

How could the standard guarantee ANYTHING about the precision of
floating point calculations when it doesn't even guarantee a common
standard?

Click to expand...

Yes I am seriously saying that the absence of an enforced standard
makes any guarantee IMPOSSIBLE.

Click to expand...

[...]
Yes, jacob, it's true that the C standard makes no guarantees about
floating-point precision. It does make some guarantees about
floating-point range, which seems to contradict your claim above that
"there is *nothing* the C language by itself can guarantee the user".
(Perhaps in context it was sufficiently clear that you were talking
only about precision; I'm not going to bother to go back up the thread
to check.)

Click to expand...

You quoted it: "How could the standard guarantee ANYTHING about
the precision of floating point calculations..."

I was referring to jacob's earlier statement, which did not mention
precision:
| Since the standard doesn't force even IEEE754, there is *nothing*
| the C language by itself can guarantee the user.

[...]

This is an example of how the "regulars" spread nonsense just with the
only objective of "contradicting jacob" as they announced here
yesterday.

Click to expand...

Tedious rant ignored.

Click to expand...

So you just demonstrated that JN's paranoia is not *quite* paranoia.
Your post is exactly what he refers to (using strong words like
"nonsense", since JN gets pretty emotional in this grand war).

World will stop, and all newbies will get confused and die if you
stop arguing with Jacob, sure. You simlpy have to do it again and
again. Even when you don't have a point! Oh well.

Look again, I *agreed* with some of what jacob was saying; the
C standard doesn't make any guarantees about the precsion of
floating-point operations.

Keith Thompson · Dec 13, 2007

Keith Thompson said:
I was referring to jacob's earlier statement, which did not mention
precision:
| Since the standard doesn't force even IEEE754, there is *nothing*
| the C language by itself can guarantee the user.

And, to be clear, we should have been referring to accuracy rather
than precision.

Keith Thompson · Dec 13, 2007

Keith Thompson said:
I was referring to jacob's earlier statement, which did not mention
precision:
| Since the standard doesn't force even IEEE754, there is *nothing*
| the C language by itself can guarantee the user.

And, to be clear, we should have been referring to accuracy rather
than precision.

Chris Torek · Dec 13, 2007

... read http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html

It answers your questions fairly comprehensively and gives a simple
way of disabling extended precision for the duration of a program's
execution.

It also notes, in passing, that setting the precision field in
the FPU control word does *not* affect the exponent range. This
means that the in-FPU computations are *still* not quite what
one might expect from an IEEE implementation.

I have the same note (with an example) at the
end of <http://web.torek.net/torek/c/numbers.html>.

As someone else noted else-thread, one can get "proper" IEEE floating
point on the x86 using gcc's "-ffloat-store" flag. This does,
however, slow down computation.

Fibonacci	0	May 13, 2023
Java OpenJDK Floating Point Dare	3	Jan 17, 2023
Crossword	2	May 11, 2020
Crossword	14	May 13, 2020
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Floating point to integer casting	48	Oct 10, 2009
floating point	13	Jul 24, 2012
Java Problems (Really Need Help!)	2	Jan 27, 2014

Trouble with integer floating point conversion

jacob navia

rembremading

jacob navia

James Kuyper

Ben Bacarisse

Dik T. Winter

Thad Smith

James Kuyper

Mark Bluemel

jacob navia

rembremading

CBFalconer

jacob navia

Mark Bluemel

rembremading

jacob navia

Keith Thompson

Keith Thompson

Keith Thompson

Chris Torek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads