Multiplications by powers of 2

Lisa Simpson · Jun 27, 2006

I am writing a program (precisely an image processing filter) which computes
several multiplications of floating-point numbers by constant powers of 2.
This means adding an integer constant to the exponent without changing the
mantissa, and I believed that this was simpler than multiplying two
floating-point numbers. Instead, if I write the program using the C "ldexp"
function, it runs much slower and produces the same output.

Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel's website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.

I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2, or if I should use standard
multiplication. I must use floating-point arithmetic because the input data
can span several orders of magnitude and the program also computes square
roots and exponentials. Speed is important, because the program should
process high-resolution video.

I tried to add an integer to the exponent using an union, but this is still
a bit slower than pure floating-point multiplication. I guess this happens
because the processor uses different registers for integer and
floating-point operations, therefore some clock cycles are wasted in moving
data around. Also in this way underflows give incorrect results, e.g.
"multiplying" 0.0 by 2^(-1) gives -Inf. Maybe I could gain some speed by
using SSE instructions, which (as far as I know) use the same set of
registers for both integer and floating-point operations, but this will not
solve the underflow problem.

I hope someone could help me. Thanks in advance.

An Electronics Engineering PhD student

pete · Jun 27, 2006

Lisa Simpson wrote:

I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2,
or if I should use standard multiplication.

Use standard multiplication.
If one of the operands of the multiplication operator
is a constant expression with a value of 2,
then your compiler has all the information it needs
to do any optimization that there is to be done.

Skarmander · Jun 27, 2006

Lisa said:
I am writing a program (precisely an image processing filter) which computes
several multiplications of floating-point numbers by constant powers of 2.
This means adding an integer constant to the exponent without changing the
mantissa, and I believed that this was simpler than multiplying two
floating-point numbers. Instead, if I write the program using the C "ldexp"
function, it runs much slower and produces the same output.

Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel's website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.

Why not? Common operations are optimized; uncommon ones are not. Just
because you think one operation ought to be faster because it's conceptually
simpler doesn't mean that will be the case.

I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2, or if I should use standard
multiplication.

You should use multiplication, since you're trying to multiply.

Leave optimization to the compiler. If timing shows you an program isn't
fast enough for your purposes, use profiling to determine where the
bottleneck is, and find ways to remove it. The bottlenecks are almost never
primitive operations, but rather the algorithm that uses them.

I must use floating-point arithmetic because the input data
can span several orders of magnitude and the program also computes square
roots and exponentials. Speed is important, because the program should
process high-resolution video.

"Speed is important" is not the same as "I know how important the speed is,
what the current speed is, and where the speed isn't sufficient".

Speed is almost never *unimportant*, but that doesn't mean you should
optimize prematurely.

I tried to add an integer to the exponent using an union, but this is still
a bit slower than pure floating-point multiplication. I guess this happens
because the processor uses different registers for integer and
floating-point operations, therefore some clock cycles are wasted in moving
data around. Also in this way underflows give incorrect results, e.g.
"multiplying" 0.0 by 2^(-1) gives -Inf. Maybe I could gain some speed by
using SSE instructions, which (as far as I know) use the same set of
registers for both integer and floating-point operations, but this will not
solve the underflow problem.

SSE may very well speed up things, but not for the reasons you mention. SSE
is designed to quickly apply a single operation to multiple sets of data.
Rather than multiply two numbers, SSE could be used to multiply vectors of
numbers, effectively doing many multiplications in concert. Modern compilers
can even make use of this transparently in some circumstances.

How to use SSE and comparable vectorizing technologies is off-topic to this
ng, but there's plenty of information out there. This seems much more
relevant to your problem than trying to optimize multiplication.

S.

Roberto Waltman · Jun 27, 2006

Lisa Simpson said:
I am writing a program (precisely an image processing filter) which computes
several multiplications of floating-point numbers by constant powers of 2.
This means adding an integer constant to the exponent without changing the
mantissa, and I believed that this was simpler than multiplying two
floating-point numbers. Instead, if I write the program using the C "ldexp"
function, it runs much slower and produces the same output.

Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel's website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.

I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2, or if I should use standard
multiplication. I must use floating-point arithmetic because the input data
can span several orders of magnitude and the program also computes square
roots and exponentials. Speed is important, because the program should
process high-resolution video.

Two very generic, very vague, and probably useless suggestions:

(a) Could you perform a chain of calculations knowing what error is
accumulating and correct the result only once at the end?

(b) May be you could get faster results using fixed point / multiple
precision arithmetic, and converting to floating point only when
needed. (Doubtful, but worth checking nevertheless.)

inmatarian · Jun 27, 2006

Lisa said:
Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel's website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.

It's probably a case of intel optomizing their processors to handle
common operations at the expense of uncommon ones. The benchmark
programs will only use the common operations themselves, and it
translates into better performance results against the competition.

Inmatarian.

Morris Dovey · Jun 28, 2006

Lisa Simpson (in [email protected]) said:

| Since the program should run on a Pentium 4 processor, I downloaded
| a manual from Intel's website
| http://www.intel.com/design/pentium4/manuals/248966.htm
| and indeed floating-point multiplication ("fmul" in assembler) has
| a latency of 8 clock cycles, while changing the exponent ("fscale"
| in assembler) has a latency of 60 cycles. This makes no sense to me.

I suspect that you're more qualified to answer this question than I;
but it does appear that 8 clocks constitute a "bargain".

pnreddy1976 · Jun 28, 2006

hi,
u search for Q15/Q31 formats for this operations.
hope u will do best by using this formats.
Reddy

Lisa Simpson · Jun 28, 2006

If one of the operands of the multiplication operator

is a constant expression with a value of 2,
then your compiler has all the information it needs
to do any optimization that there is to be done.

I examined the assembly code generated by the compiler (GCC 4.1.0, with
optimization turned on). If I multiply by 2, the compiler is smart enough to
replace the multiplication with an addition, but if I multiply by 0.5 (or
divide by 2) it uses standard floating-point multiplication. Unfortunately,
we can't afford to buy a commercial compiler.

pete · Jun 28, 2006

Lisa said:
I examined the assembly code generated by the
compiler (GCC 4.1.0, with
optimization turned on). If I multiply by 2,
the compiler is smart enough to
replace the multiplication with an addition,
but if I multiply by 0.5 (or divide by 2)
it uses standard floating-point multiplication. Unfortunately,
we can't afford to buy a commercial compiler.

If you examine the bit representation
of your floating point numbers,
then maybe you can figure something out.

/* BEGIN output from bitstr.c */

0.001100 = 00111010100100000010110111100000
0.002200 = 00111011000100000010110111100000
0.004400 = 00111011100100000010110111100000
0.008800 = 00111100000100000010110111100000
0.017600 = 00111100100100000010110111100000
0.035200 = 00111101000100000010110111100000
0.070400 = 00111101100100000010110111100000
0.140800 = 00111110000100000010110111100000
0.281600 = 00111110100100000010110111100000
0.563200 = 00111111000100000010110111100000
1.126400 = 00111111100100000010110111100000
2.252800 = 01000000000100000010110111100000
4.505600 = 01000000100100000010110111100000
9.011200 = 01000001000100000010110111100000
18.022400 = 01000001100100000010110111100000
36.044800 = 01000010000100000010110111100000
72.089600 = 01000010100100000010110111100000
144.179199 = 01000011000100000010110111100000
288.358398 = 01000011100100000010110111100000
576.716797 = 01000100000100000010110111100000
1153.433594 = 01000100100100000010110111100000
2306.867188 = 01000101000100000010110111100000
4613.734375 = 01000101100100000010110111100000
9227.468750 = 01000110000100000010110111100000
18454.937500 = 01000110100100000010110111100000

/* END output from bitstr.c */

/* BEGIN bitstr.c */

#include <limits.h>
#include <stdio.h>

#define STRING "%12f = %s\n"
#define E_TYPE float
#define P_TYPE double
#define INITIAL 0.0011f
#define FINAL 10000
#define INC(E) ((E) *= 2)

typedef E_TYPE e_type;
typedef P_TYPE p_type;

void bitstr(char *str, const void *obj, size_t n);

int main(void)
{
e_type e;
char ebits[CHAR_BIT * sizeof e + 1];

puts("\n/* BEGIN output from bitstr.c */\n");
e = INITIAL;
bitstr(ebits, &e, sizeof e);
printf(STRING, (p_type)e, ebits);
while (FINAL > e) {
INC(e);
bitstr(ebits, &e, sizeof e);
printf(STRING, (p_type)e, ebits);
}
puts("\n/* END output from bitstr.c */");
return 0;
}

void bitstr(char *str, const void *obj, size_t n)
{
unsigned mask;
const unsigned char *byte = obj;

while (n-- != 0) {
mask = ((unsigned char)-1 >> 1) + 1;
do {
*str++ = (char)(mask & byte[n] ? '1' : '0');
mask >>= 1;
} while (mask != 0);
}
*str = '\0';
}

/* END bitstr.c */

Morris Dovey · Jun 28, 2006

Lisa Simpson (in [email protected]) said:

|| If one of the operands of the multiplication operator
|| is a constant expression with a value of 2,
|| then your compiler has all the information it needs
|| to do any optimization that there is to be done.
|
| I examined the assembly code generated by the compiler (GCC 4.1.0,
| with optimization turned on). If I multiply by 2, the compiler is
| smart enough to replace the multiplication with an addition, but if
| I multiply by 0.5 (or divide by 2) it uses standard floating-point
| multiplication. Unfortunately, we can't afford to buy a commercial
| compiler.

Very interesting. Have you considered working with the maintainer(s)
of the relevant gcc modules to complete the symmetry? Getting what
you'd like to have /may/ be easier than you might expect - and the
benefits could extend to a great many people...

Tim Prince · Jun 28, 2006

Lisa said:
I examined the assembly code generated by the compiler (GCC 4.1.0, with
optimization turned on). If I multiply by 2, the compiler is smart enough to
replace the multiplication with an addition, but if I multiply by 0.5 (or
divide by 2) it uses standard floating-point multiplication. Unfortunately,
we can't afford to buy a commercial compiler.

Doesn't that look like a reasonable thing for gcc to do? In some
special circumstances, if you are doing no other floating point
operations on the data, and you know there will be no exponent range
problems, you might get slightly faster results by integer addition. It
seems unlikely that such situations could be recognized by a compiler;
if you find an advantage in it, you could write the highly non-portable
code yourself.

Stephen Sprunk · Jun 28, 2006

Lisa Simpson said:
I examined the assembly code generated by the compiler (GCC 4.1.0, with
optimization turned on). If I multiply by 2, the compiler is smart enough
to replace the multiplication with an addition, but if I multiply by 0.5
(or divide by 2) it uses standard floating-point multiplication.
Unfortunately, we can't afford to buy a commercial compiler.

I'd suggest posting a question on a GCC-specific list, like gnu.gcc.help.
The community is pretty responsive to "simple" optimizations like this once
someone points them out.

S

Dik T. Winter · Jun 28, 2006

> news:[email protected]... ....
>
> I'd suggest posting a question on a GCC-specific list, like gnu.gcc.help.
> The community is pretty responsive to "simple" optimizations like this once
> someone points them out.

Multiplication by 2 is easy to transform to addition of the number with
itself. There is no such easy transformation for division by 2. You
can try manipulation of the exponent, but that works only if the
exponent is within certain ranges, something the compiler does not
know. I do not think there is any compiler that would division by 2
transform to something else.

Al Balmer · Jun 28, 2006

I examined the assembly code generated by the compiler (GCC 4.1.0, with
optimization turned on). If I multiply by 2, the compiler is smart enough to
replace the multiplication with an addition, but if I multiply by 0.5 (or
divide by 2) it uses standard floating-point multiplication.

I must be missing something - what did you expect it to do? As you
have already discovered, manipulating the exponent is not efficient.

Dann Corbit · Jun 28, 2006

Dik T. Winter said:
Multiplication by 2 is easy to transform to addition of the number with
itself. There is no such easy transformation for division by 2. You
can try manipulation of the exponent, but that works only if the
exponent is within certain ranges, something the compiler does not
know. I do not think there is any compiler that would division by 2
transform to something else.

How about multiplication by 0.5? (Since we are talking about floating
point).

pow(), precision, and large powers of 2.	9	May 2, 2011
"table of powers" problem	7	Dec 28, 2006
Logic Problem with BigInteger Method	2	Aug 26, 2023
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
is 0xE-2 a valid expression?	48	Oct 15, 2009
Undefined behavior - 2 queries	25	Sep 8, 2007
2 counters, counting states of Bit stream	16	Jun 6, 2013
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021

Multiplications by powers of 2

Lisa Simpson

pete

Skarmander

Roberto Waltman

inmatarian

Morris Dovey

pnreddy1976

Lisa Simpson

pete

Morris Dovey

Tim Prince

Stephen Sprunk

Dik T. Winter

Al Balmer

Dann Corbit

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads