32-bit IEEE float multiplication

A

Andy

Hi,
I don't know if this is the correct group to post this, but
when I multiply a huge floating point value by a really
small (non-zero) floating point value, I get 0 (zero) for the result.
This creates a big hole in a 32-bit timer routine I wrote.

Questions.
1. Why does this happen?
2. Is there C macros/functions I can call to tell me
when two non-zero numbers are multiplied and the
result will be zero?

TIA
Andy
 
A

Arthur J. O'Dwyer

I don't know if this is the correct group to post this, but
when I multiply a huge floating point value by a really
small (non-zero) floating point value, I get 0 (zero) for the result.
This creates a big hole in a 32-bit timer routine I wrote.

Questions.
1. Why does this happen?

It's in the FAQ. In general, Computers Are Not Math. Computers,
even big fast ones, deal exclusively in fixed-width data. That
means you only get a finite amount of precision to work with.
Keep dividing X by 2 over and over again, and eventually X will
get so small that it's indistinguishable from zero. And on a
computer, that means it's *equal* to zero. It's called "truncation"
or "underflow" or "overflow" or "rounding error" (depending on
exactly what's going on), and it's something that every numeric
programmer should understand.
Your subject line suggests you understand what a "bit" is; so
think about it this way. You're using 32-bit numbers -- 32 bits
can hold 2**32 different values -- so if the number you want to
compute isn't one of those 2**32 particular values representable
by your 32-bit numbers, then you'll get rounding error. If your
number is very close to zero, then it'll *become* zero, just because
that's the nearest representation the computer can get.
2. Is there C macros/functions I can call to tell me
when two non-zero numbers are multiplied and the
result will be zero?

if (a != 0 && b != 0 && a*b == 0)
puts("duh");

-Arthur
 
C

Christian Bau

Hi,
I don't know if this is the correct group to post this, but
when I multiply a huge floating point value by a really
small (non-zero) floating point value, I get 0 (zero) for the result.
This creates a big hole in a 32-bit timer routine I wrote.

Questions.
1. Why does this happen?
2. Is there C macros/functions I can call to tell me
when two non-zero numbers are multiplied and the
result will be zero?

This is unusual. Could you post which compiler is used, and post source
code that demonstrates the problem? You mean something like

double result = 1e300 * 1e-300;

?
 
S

Sidney Cadot

Andy said:
Hi,
I don't know if this is the correct group to post this, but
when I multiply a huge floating point value by a really
small (non-zero) floating point value, I get 0 (zero) for the result.
This creates a big hole in a 32-bit timer routine I wrote.

Questions.
1. Why does this happen?
2. Is there C macros/functions I can call to tell me
when two non-zero numbers are multiplied and the
result will be zero?

TIA
Andy

This is not really on-topic here, but I wouldn't know where it would be,
so let's give it a go.

Assuming 32-bit IEEE floating point numbers (from your subject line),
their product should be the representation of the number that is closest
to the nearest representable number, subject to rounding. To quote the
first paragraph of Section 4 of IEEE 754-1985:

"Rounding takes a number regarded as infinitely precise and, if
necessary, modifies it to fit in the destination’s format while
signaling the inexact exception (7.5). Except for binary <---> decimal
conversion (whose weaker conditions are specified in 5.6), every
operation specified in Section 5 shall be performed as if it first
produced an intermediate result correct to infinite precision and with
unbounded range, and then rounded that result according to one of the
modes in this section. "

(and, of course, multiplication is specified in Section 5).

So if you are working using IEEE 754-compliant floating point, this
could be the case. To work this out, please show the exact (hex)
representation of the two numbers being multiplied and the result (is
the result a single- or double-precision number)?

Furher, it would help to know the hardware or software your running this on.

And lastly, the minimal program you can produce to show the problem
would be of assistance in understanding this.


Best regards,

Sidney Cadot
 
S

Sidney Cadot

Arthur said:
It's in the FAQ. In general, Computers Are Not Math. Computers,
even big fast ones, deal exclusively in fixed-width data. That
means you only get a finite amount of precision to work with.
Keep dividing X by 2 over and over again, and eventually X will
get so small that it's indistinguishable from zero.

That being true of course in general, IEEE-754 has some pretty strict
rules on what to expect, and I think they preclude the behavior that the
OP describes. Probably a dud, but worth checking out if only because it
beats the 'Indian C programmers and "u"' hands-down for interestingness.

Best regards,

Sidney
 
J

John Smith

Arthur said:
think about it this way. You're using 32-bit numbers -- 32 bits
can hold 2**32 different values -- so if the number you want to
compute isn't one of those 2**32 particular values representable
by your 32-bit numbers, then you'll get rounding error. If your
number is very close to zero, then it'll *become* zero, just because
that's the nearest representation the computer can get.

I have always been a little hazy on this issue (some fractional base 10
values not representable in base 2). I have "almost" understood it --
until hearing it expressed in this way. It is like the understanding you
acquire after finally hearing the sound of one hand clapping. Thanks,
Arthur.

JS
 
K

Keith Thompson

I don't know if this is the correct group to post this, but
when I multiply a huge floating point value by a really
small (non-zero) floating point value, I get 0 (zero) for the result.
This creates a big hole in a 32-bit timer routine I wrote.

Questions.
1. Why does this happen?
2. Is there C macros/functions I can call to tell me
when two non-zero numbers are multiplied and the
result will be zero?

Floating-point arithmetic is strange, but it's not usually *that*
strange. I'll call your huge number huge_num, and your small number
small_num. Presumably they meet the following conditions:

small_num > 0.0
small_num < 1.0
huge_num > 1.0

I would expect the following to be true, even in the presence of
rounding errors:

small_num * 1.0 == small_num
small_num * X >= small_num, for any X >= 1.0 (barring overflow)
small_num * huge_num > small_num
therefore
small_num * huge_num > 0.0

I can imagine small_num * huge_num yielding 0.0 if small_num is a
denorm (maybe), or if small_num and huge_num are of different types.

Are you sure that the result of the multiplication is 0.0? If you're
displaying it with something like a printf "%f" format, it may be
rounding to zero on output, not in the computation.

What are the actual types and values of small_num and huge_num? Can
you post a small self-contained program that exhibits the problem?
Try using printf's "%g" format for output, or even something like
"%.50g" to display more digits.
 
D

Dik T. Winter

> Floating-point arithmetic is strange, but it's not usually *that*
> strange. I'll call your huge number huge_num, and your small number
> small_num. Presumably they meet the following conditions:
>
> small_num > 0.0
> small_num < 1.0
> huge_num > 1.0
>
> I would expect the following to be true, even in the presence of
> rounding errors:
>
> small_num * 1.0 == small_num
> small_num * X >= small_num, for any X >= 1.0 (barring overflow)

Overflow can not occur when 0.0 < small_num < 1.0, because the product
is smaller than or equal to X (equality is possible due to rounding
when small_num is very close to 1.0).
> small_num * huge_num > small_num
> therefore
> small_num * huge_num > 0.0
>
> I can imagine small_num * huge_num yielding 0.0 if small_num is a
> denorm (maybe), or if small_num and huge_num are of different types.

Not even when small_num is a denormalised number. Because huge_num > 1.0
the product must be larger than or equal to small_num (equality is possible
when some rounding occurs and huge_num is very close to 1.0).
> Are you sure that the result of the multiplication is 0.0? If you're
> displaying it with something like a printf "%f" format, it may be
> rounding to zero on output, not in the computation.

I think this is the most likely cause.

Under IEEE 0.0 can only be the result of a multiplication when one of the
operands is 0.0 (obviously) or when both operands are in absolute value
smaller than 1.0 and their mathematical product is in absulute value
smaller than the smallest representable number.
 
A

Arthur J. O'Dwyer

I have always been a little hazy on this issue (some fractional base 10
values not representable in base 2). I have "almost" understood it --
until hearing it expressed in this way. It is like the understanding you
acquire after finally hearing the sound of one hand clapping. Thanks,
Arthur.

You're welcome!
But it occurs to me (after reading some of the other replies)
that there is at least one point I should have mentioned in my
response, and one more point that you still seem to be slightly
"hazy" on:

First, as several others have pointed out, if we have

double a,b,c; /* initialized somehow */
assert(a > 1.0);
assert(b > 0.0);
assert(b < 1.0);
c = a*b;

then it is always the case that

assert(c != 0.0);

However, it is plausible that the OP might have initialized
'b' in such a way as to make him *think* it was a small positive
value, while in fact it had already been corrupted by round-off
error. For example, I think

b = 1e-400; /* Really small positive number? -- NOT! */

is likely to set 'b' to 0.0 on many common platforms. (If I'm
mistaken, just add one more zero to the end of that exponent. ;)
So that's where the OP should be looking in this case. (Or he
should look and see whether he's using the %f or %g format
specifiers to display the numbers, as others have said.)

Secondly, you talk about "fractional base 10 values not
representable in base 2." Such values do exist (modulo certain
objections about the nature of "representable") -- for example,

0.1 (base 10) = 0.00011001100110011... (base 2)

That is to say, 1/10 has a binary representation that can't be
precisely represented with any finite number of bits -- as opposed
to, say, 1/2 or 3/16, which can.
However, I was talking mostly about numbers that *can* be
represented with a finite number of bits -- the problem is just
that they require *more* bits than the OP's computer is willing
to allocate. For example,

pow(2, -10000)

is mathematically representable as

0.000000... (several thousand more zeroes) ...000001 (base 2),
or just
1 (base 2) multiplied by 2 to the -10011100010000 (base 2) power.

But the OP's 32-bit floating-point format doesn't have enough
bits in the exponent to represent this number (we need 14 bits
to store "-10000" in binary, and the IEEE format only has 8 bits
in its exponent field).
So even though the number pow(2, -10000) has a finite binary
expansion, it's *still* not perfectly representable by the OP's
computer. And so it rounds off to 0.0, and we have problems. :)

HTH,
-Arthur
 
D

Dan Pop

In said:
Hi,
I don't know if this is the correct group to post this, but
when I multiply a huge floating point value by a really
small (non-zero) floating point value, I get 0 (zero) for the result.
This creates a big hole in a 32-bit timer routine I wrote.

Questions.
1. Why does this happen?

Show us the code. A minimal, but complete program illustrating your
problem.
2. Is there C macros/functions I can call to tell me
when two non-zero numbers are multiplied and the
result will be zero?

If the (positive) result is less than FLT_MIN, i.e. it cannot be
represented as a floating number, it is zero.

Dan
 
A

Andy

The compiler is Keil for Intel 8051 and derivatives. It's
an embedded compiler. It was actually my mistake. I think the
code is actually working, but the test code I wrote had a race
condition on the variable gcGCLK with the ISR that actually
increments this variable. The actual code is appended at the
end of my message.
But here's a real question, given the code below

unsigned long ticks = 0; /* 32-bit */
float fVal; /* 32-bit IEEE-754 */

while(++ticks) {
fVal = (float)ticks * 0.004;
}

1. Can I expect fVal to always not decreasing as ticks is
incremented? (until of course when ticks wraps around
to 0)
2. Does fVal covers all the integral values? ie, could it
go from say 56 to 60 skipping 57, 58, and 59?
3. In this example, each integral value equals to 250
ticks. Are all intervals between any two consecutive
integral values of fVal always 250 ticks? (within tolerance
of course). ie, between fVal of 250 and 251, there're
250 ticks. Is this true for all integral intervals of fVal?
3. How about for a smaller number like 0.00004.

Sorry to have asked so many questions. I'm not too good with
this IEEE float thing.

/* the actual test code I wrote */
typedef unsigned long GCLK_T;
#define SECS_PER_TICK 0.004
GCLK_T RealElapsedTicks(GCLK_T zStart) {
return(gzGCLK - zStart); /* gzGCLK gets incremented in an ISR */
}
GCLK_T RealElapsedSecs(GCLK_T zStart) {
float fVal;

gzGCLK = 0xffffffff; /* ERROR. Race condition with ISR */
zStart = 0x00000000;

fVal = (float)RealElapsedTicks(zStart) * ((float)GCLK_SEC_PER_TICK);
return(fVal);
}

TIA
Andy
 
D

Dik T. Winter

> But here's a real question, given the code below
>
> unsigned long ticks = 0; /* 32-bit */
> float fVal; /* 32-bit IEEE-754 */
>
> while(++ticks) {
> fVal = (float)ticks * 0.004;
> }
>
> 1. Can I expect fVal to always not decreasing as ticks is
> incremented? (until of course when ticks wraps around
> to 0)
Yes.

> 2. Does fVal covers all the integral values? ie, could it
> go from say 56 to 60 skipping 57, 58, and 59?

No. It will not go from 56 to 60, but it may go from slightly less than
59 to slightly more than 59, skipping 59 itself.
> 3. In this example, each integral value equals to 250
> ticks. Are all intervals between any two consecutive
> integral values of fVal always 250 ticks? (within tolerance
> of course). ie, between fVal of 250 and 251, there're
> 250 ticks. Is this true for all integral intervals of fVal?

Because of the above observation, no, not necessarily.
> 3. How about for a smaller number like 0.00004.

Similar answer. 0.04 and 0.00004 are not exactly representable as
floating point numbers, so rounding occurs both when the representation
of those numbers is created and when this value is used in the
multiplication. A better way would be to calculate:
(float)ticks / 250.0
but that may be slower on your system. (250.0 is exactly representable,
so IEEE mandates that when ticks is an integer that is a multiple of
250 the result should be exact.)

Another problem with your code is that when ticks exceeds 2**24 the
number is no longer exactly represenatable as a floating point number,
so all bets are off.

To get the best possible answer you need something like:
(float)(ticks / 250) + (float)(ticks % 250)/250.0
 
D

Dan Pop

In said:
The compiler is Keil for Intel 8051 and derivatives. It's
an embedded compiler. It was actually my mistake. I think the
code is actually working, but the test code I wrote had a race
condition on the variable gcGCLK with the ISR that actually
increments this variable. The actual code is appended at the
end of my message.
But here's a real question, given the code below

unsigned long ticks = 0; /* 32-bit */
float fVal; /* 32-bit IEEE-754 */

while(++ticks) {
fVal = (float)ticks * 0.004;

You're evaluating this expression using double precision, which is
probably the last thing you want on a 8051 or any other processor with
no hardware support for floating point. The right expression is:

fVal = ticks * 0.004f;
}

1. Can I expect fVal to always not decreasing as ticks is
incremented? (until of course when ticks wraps around
to 0)
Yes.

2. Does fVal covers all the integral values? ie, could it
go from say 56 to 60 skipping 57, 58, and 59?

If fVal ever reaches the value 56, its next values will be 56.004,
56.008 and so on. Actually, approximations of these values, because
IEEE-754 cannot represent them exactly.
3. In this example, each integral value equals to 250
ticks. Are all intervals between any two consecutive
integral values of fVal always 250 ticks? (within tolerance
of course). ie, between fVal of 250 and 251, there're
250 ticks. Is this true for all integral intervals of fVal?

0.004 cannot be represented by IEEE-754 exactly, because it is not a
combination of negative powers of two. Therefore, your computations
may seldom yield an integer value when the mathematically exact result
is an integer value.

Things would be different for 0.00390625 (256 ticks per integral value)
because this value can be represented exactly.
3. How about for a smaller number like 0.00004.

No difference.
Sorry to have asked so many questions. I'm not too good with
this IEEE float thing.

You may want to understand it, to avoid surprises. Binary floating point
arithmetic doesn't work the same way as normal arithmetic, mostly because
most real numbers cannot be represented exactly.

Dan
 
E

Eric Sosman

Dik T. Winter said:
But here's a real question, given the code below

unsigned long ticks = 0; /* 32-bit */
float fVal; /* 32-bit IEEE-754 */

while(++ticks) {
fVal = (float)ticks * 0.004;
}
[...]
Another problem with your code is that when ticks exceeds 2**24 the
number is no longer exactly represenatable as a floating point number,
so all bets are off.

Not "all" bets, but it certainly scuttles any hope of
strict monotonicity.

Here's the scenario: `ticks' counts up to a large power
of two, about 2**24 or 2**25 (I'd have to pull out my IEEE
reference to be sure of the value, but the exact location
of this boundary isn't essential to understanding the effect).
Up to this point, the expression `(float)ticks' has produced
an exact conversion: the result is exactly equal to the
original value of `ticks'.

But at the next upward count a problem arises: `ticks'
now has one more significant bit than a `float' can handle.
(Imagine counting upwards in decimal arithmetic with three
significant digits: 999 is fine, 1000==1e3 is fine, but
1001 has too many digits.) So the conversion is inexact,
and if "round to even" is in effect the result will be a
hair too small -- in fact, it will be the same result as was
obtained from the preceding exact conversion. That is, `ticks'
increased but `(float)ticks' did not.

On the next count, the problem disappears momentarily:
the low-order bit of `ticks' is now a zero, so the number of
significant bits is once again within the capacity of `float'.
The conversion is again exact -- but look what's happened: the
value `(float)ticks' has advanced two steps at once. You've
entered a regime where `(float)ticks' "sticks" at one value
for two ticks before advancing to the correct result; it
"increments every other time."

As you go higher still, `ticks' will eventually attain
two more significant bits than a `float' can handle, and
will "stick" at one value for four cycles before increasing.
And then you'll get to three bits too many and an eight-
cycle plateau, and so on. (Consider the decimal analogy
again: after you reach 1000000==100e4, you've got to count
upward many more times before reaching 101e4==1010000.)

However, the cure for your case seems obvious: Whether
you know it or not, you're actually employing `double'
arithmetic in this expression because the constant `0.004'
has type `double'. So why convert to `float', losing a few
bits in the process, only to go ahead and re-convert that
mangled value to a `double'? Just make `fVal' a `double'
to begin with and get rid of the `(float)' cast, and you
should be immune to this effect.

There may be other problems elsewhere, of course, but
this problem, at least, will cease to bother you.
 
D

Dan Pop

In said:
However, the cure for your case seems obvious: Whether
you know it or not, you're actually employing `double'
arithmetic in this expression because the constant `0.004'
has type `double'. So why convert to `float', losing a few
bits in the process, only to go ahead and re-convert that
mangled value to a `double'? Just make `fVal' a `double'
to begin with and get rid of the `(float)' cast, and you
should be immune to this effect.

Given its target platform, what he may really want to do is to replace
0.004 by 0.004f so that double precision arithmetic is completely
avoided. That is, unless his application has plenty of spare cpu cycles
to burn...

Single precision IEEE-754 floating point is already painfully slow
on an 8-bit micro with no hardware floating point support. Any trick
that allows avoiding floating point completely is a big win (and a big
saver of ROM memory space).

Dan
 
E

Eric Sosman

Dan said:
Given its target platform, what he may really want to do is to replace
0.004 by 0.004f so that double precision arithmetic is completely
avoided. That is, unless his application has plenty of spare cpu cycles
to burn...

Single precision IEEE-754 floating point is already painfully slow
on an 8-bit micro with no hardware floating point support. Any trick
that allows avoiding floating point completely is a big win (and a big
saver of ROM memory space).

I'm not familiar with his platform; maybe `double'
is out of the question. If so, I don't see how he can
avoid the "stair-step" problem that occurs with large
counts, except possibly by breaking the count into two
pieces and extracting two `float' quantities instead
of one. E.g.,

float hi, lo;
lo = (ticks & 0xFFFF) * 0.004f;
hi = (ticks >> 16) * (0.004f * 65536);

Of course, then he's stuck with two `float' values and the
necessity to handle them both, more or less doubling the
amount of work that needs to be done with them elsewhere.
`double' might, perhaps, turn out to be cheaper after all.

To the O.P.: What is the purpose of this floating-point
result? What do you do with it; what decisions are based
upon it? Perhaps we can come up with a way to avoid floating-
point altogether, and stay strictly in integer-land.
 
C

Christian Bau

The compiler is Keil for Intel 8051 and derivatives. It's
an embedded compiler. It was actually my mistake. I think the
code is actually working, but the test code I wrote had a race
condition on the variable gcGCLK with the ISR that actually
increments this variable. The actual code is appended at the
end of my message.
But here's a real question, given the code below

unsigned long ticks = 0; /* 32-bit */
float fVal; /* 32-bit IEEE-754 */

while(++ticks) {
fVal = (float)ticks * 0.004;
}

1. Can I expect fVal to always not decreasing as ticks is
incremented? (until of course when ticks wraps around
to 0)
2. Does fVal covers all the integral values? ie, could it
go from say 56 to 60 skipping 57, 58, and 59?
3. In this example, each integral value equals to 250
ticks. Are all intervals between any two consecutive
integral values of fVal always 250 ticks? (within tolerance
of course). ie, between fVal of 250 and 251, there're
250 ticks. Is this true for all integral intervals of fVal?
3. How about for a smaller number like 0.00004.

You will have a problem with large numbers. IEEE 32 bit float has 24 bit
for the mantissa including the leading "1" in the mantissa which is
never stored.

So for floating point numbers 1 <= x < 2, the resolution is 2^-23. That
means the difference between x and the next largest floating point
number is 2^-23. For 2^23 <= x < 2^24 the resolution is 1. For 2^31 <= x
< 2^32 the resolution is 2^8 = 256, so the difference between one
floating point number and the next larger floating point number is 256.
256 * 0.004 = 1.024 > 1 so yes, you will not cover all integral values
when x is large.

Is there a good reason why you don't write

ticks / 250

?
 
D

Dik T. Winter

> Dan Pop wrote: ....
>
> I'm not familiar with his platform; maybe `double'
> is out of the question.

Yup. Actually normal fp is also out of the question...
> If so, I don't see how he can
> avoid the "stair-step" problem that occurs with large
> counts, except possibly by breaking the count into two
> pieces and extracting two `float' quantities instead
> of one. E.g.,
>
> float hi, lo;
> lo = (ticks & 0xFFFF) * 0.004f;
> hi = (ticks >> 16) * (0.004f * 65536);

This ignores something he wants. When ticks is a multiple of 250
the exact integer value should be produced as a floating point number.
What I wrote in a previous post still stands:
(float)(ticks / 250) + (float)(ticks % 250) * 0.004f.
I am looking for a way to do "ticks / 250" faster on an 8-bit micro
than just division (which, again, is slow).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top