Fast return of two longs

J

Jonathan Lee

Hi all,
A petty optimization question:
I'm working on optimizing a multi-precision integer library and
wonder what the fastest way to return two unsigned longs would be.
Part of my code is a schoolbook multiplication algorithm which calls a
word-size multiplication algorithm and returns the double width result
in an array. For example

unsigned long result[2];
...
do {
mulUL(factor1, factor2, result);
...
} while(still_computing);

In an ideal world, mulUL() is just a wrapper for the assembly language
instructions and it would be great if "result" stayed in the
registers. They're used right away and then discarded.

I thought about writing a small class that contained two unsigned
longs (faking unsigned long long) and writing instead

ull_t res = mulUL(factor1, factor2);

Or even "constructing" a product
mul_result_t product(factor1, factor2);

But I figure this will just cost me a stack pointer adjustment every
loop iteration.

Any advice on how to encourage the "best" behaviour here?

--Jonathan

PS The advice about using a profiler can be considered understood. I
more general question still stands: can a small class with a short
life time can be passed efficiently?
 
V

Victor Bazarov

Jonathan said:
Hi all,
A petty optimization question:
I'm working on optimizing a multi-precision integer library and
wonder what the fastest way to return two unsigned longs would be.
Part of my code is a schoolbook multiplication algorithm which calls a
word-size multiplication algorithm and returns the double width result
in an array. For example

unsigned long result[2];
...
do {
mulUL(factor1, factor2, result);
...
} while(still_computing);

In an ideal world, mulUL() is just a wrapper for the assembly language
instructions and it would be great if "result" stayed in the
registers. They're used right away and then discarded.

I thought about writing a small class that contained two unsigned
longs (faking unsigned long long) and writing instead

ull_t res = mulUL(factor1, factor2);

Which is what a C++ programmer should do...
Or even "constructing" a product
mul_result_t product(factor1, factor2);

Uh... Not sure what's going on here.
But I figure this will just cost me a stack pointer adjustment every
loop iteration.

I am not sure what it means. Generate the assembly code, it will help
you avoid guessing.
Any advice on how to encourage the "best" behaviour here?

What if you just pass them by reference:

unsigned long result[2]; // not sure it needs to be an array
...
mulUL(factor1, factor2, result[0], result[1]);

? Some compilers optimize passing arguments by keeping them in
registers instead of pushing addresses around, etc. Make your function
a template or inline to see if it makes any difference.

The optimization that you're after is not necessarily best on all
platforms. You might want to have several versions of your function,
each appropriate for the particular target architecture and compiler. I
mean, if you are trying to save nanoseconds that way, that is.
--Jonathan

PS The advice about using a profiler can be considered understood. I
more general question still stands: can a small class with a short
life time can be passed efficiently?

It can, if the hardware supports (or provides) some tricks. The mantra
here is "trust your compiler" (along with "use a profiler").

V
 
J

Jonathan Lee

Hi Victor,

Uh...  Not sure what's going on here.

Sorry. I was trying to suggest that if I'm going to make a class to
hold the double long result of a multiplication, I may as well build
the multiplication into the constructor. Then I wouldn't have to worry
about the copy elision happening... or not.
I am not sure what it means.  Generate the assembly code, it will help
you avoid guessing.

True for my platform and compiler. But I always like to know what
other people's experiences are.
What if you just pass them by reference:

That sounds good. It lacks the "C++ style" that you mentioned earlier
("what a C++ programmer should do..."), but I can using two locals
instead of an array and hope they aren't put on the stack.

Thanks for your comments

--Jonathan
 
V

Victor Bazarov

Jonathan said:
Hi Victor,



Sorry. I was trying to suggest that if I'm going to make a class to
hold the double long result of a multiplication, I may as well build
the multiplication into the constructor. Then I wouldn't have to worry
about the copy elision happening... or not.

Aha... Got it.
True for my platform and compiler. But I always like to know what
other people's experiences are.


That sounds good. It lacks the "C++ style" that you mentioned earlier
("what a C++ programmer should do..."), but I can using two locals
instead of an array and hope they aren't put on the stack.

As always, "what a C++ programmer should do" is to be portable and
generic enough so the intention is understood, and to "trust your
compiler". If you want to be fancy, you need to *test* those
combinations. Now, imagine that some compiler doesn't store those
variables in registers and instead does exactly as written, passing
references (pointers for this case) in the stack, which are later
dereferenced... Now the positioning of those objects can play some role
(and not always positive), and you have to consider how the CPU cache is
maintained, etc.. Headache upon headache if you ask me.

V
 
M

Marcel Müller

Hi,

Jonathan said:
A petty optimization question:
I'm working on optimizing a multi-precision integer library and
wonder what the fastest way to return two unsigned longs would be.
Part of my code is a schoolbook multiplication algorithm which calls a
word-size multiplication algorithm and returns the double width result
in an array. For example

unsigned long result[2];
...
do {
mulUL(factor1, factor2, result);
...
} while(still_computing);

In an ideal world, mulUL() is just a wrapper for the assembly language
instructions and it would be great if "result" stayed in the
registers. They're used right away and then discarded.

You should pass the result as 64 bit integer. Most compilers can deal
with that, even quite old ones. If your platform can deal with that,
registers are used for 64 bit integers.
Since you are talking about inline assembly you will have some platform
dependent includes anyway. So if uint64_t is not an option you can use
an appropriate typedef.

Furthermore your implementation should be inline. If you are in an inner
loop and do care about a stack pointer adjustment you should not do any
function call at all. It will eat up the advantage of your imul
instruction. Depending on you compiler it might be possible to define
inline assembly functions too. E.g. Watcom has very flexible handling of
inline assembly for a long time.

I thought about writing a small class that contained two unsigned
longs (faking unsigned long long) and writing instead

ull_t res = mulUL(factor1, factor2);

Or even "constructing" a product
mul_result_t product(factor1, factor2);

But I figure this will just cost me a stack pointer adjustment every
loop iteration.

No, this is not directly related to the way you write it in C++. The
constructor of product may be inlined as well as any other function. The
variable lifetime analysis may use always the same storage for different
objects if product is still a POD type.
Any advice on how to encourage the "best" behaviour here?

If performance counts to a that important level, you will not come
around to platform specific solutions. Usually the whole inner loops are
implemented as assembly fragments, because alignment is also an
important issue.


Marcel
 
J

Jonathan Lee

Hi Marcel,

You should pass the result as 64 bit integer. Most compilers can deal
with that, even quite old ones. If your platform can deal with that,
registers are used for 64 bit integers.

Unfortunately this is a slippery slope situation. If I assume I have
long longs available to me then I would be doing the multiplication in
long longs... so I would need a long long long... :) But I get what
you're saying.
Since you are talking about inline assembly you will have some platform
dependent includes anyway.

Well, no not really. Any place I have assembly code I have portable C+
+ code as a fallback. The code can be compiled w/o any assembly. But
maybe that's beside the point.
Furthermore your implementation should be inline.

Currently I'm defining the low level multiplies and divides as static
member functions to an "Integer" class. I've added the inline keyword
before definition to encourage the inlining. I felt this was the best
choice considering C++ design goals and my performance goals.
variable lifetime analysis may use always the same storage for different
objects if product is still a POD type.

Good to know. I'll do as Victor suggested and check the dissassembly
in that case.

--Jonathan
 
P

Pavel

Jonathan said:
Hi all,
A petty optimization question:
I'm working on optimizing a multi-precision integer library and
wonder what the fastest way to return two unsigned longs would be.
Part of my code is a schoolbook multiplication algorithm which calls a
word-size multiplication algorithm and returns the double width result
in an array. For example

unsigned long result[2];
...
do {
mulUL(factor1, factor2, result);
...
} while(still_computing);

In an ideal world, mulUL() is just a wrapper for the assembly language
instructions and it would be great if "result" stayed in the
registers. They're used right away and then discarded.

I thought about writing a small class that contained two unsigned
longs (faking unsigned long long) and writing instead

ull_t res = mulUL(factor1, factor2);

Or even "constructing" a product
mul_result_t product(factor1, factor2);

But I figure this will just cost me a stack pointer adjustment every
loop iteration.

Any advice on how to encourage the "best" behaviour here?

--Jonathan

PS The advice about using a profiler can be considered understood. I
more general question still stands: can a small class with a short
life time can be passed efficiently?

I did something similar recently. What I found out was that, on modern
Intel platforms at least, the time required by multiplications
themselves clearly dominated. Thus, you can do it "right way" without
much thought (within reason, of course, a virtual call overhead may well
be at least demonstrable). This will depend a lot depends on your FSB
frequency and the number of transfers per cycle though. My results were
for 1033 MHz FSB and it was never a bottle-neck.

And, if your multi-precision package needs divisions.. just forget about
those subtleties.. the best spent time would be to understand how you
can minimize the number of divisions.

See Appendix C in Intel® 64 and IA-32 Architectures Optimization
Reference Manual for examples of instruction timings (available online
on Intel Web Site).

Also, "trust your compiler" was kinda right advise. "Kinda" because I do
not trust it, it is too smart for me now (any modern compiler I have
seen recently). For example, gcc used with -O3 with some -f options like
unroll-loops will inline much more functions that you would want (and it
is not always too good for your code size and cache locality). And no,
it would not be only functions that are inline in C++ sense. Also, it
will use all registers it can and then some; so tuning your C++ code to
let compiler know what to keep in registers may have unexpected effect
(possibly even adverse one although I did not see such effects myself).

On a side note, I could see duplications of some computations (that is,
re-computing or re-loading from memory the values that were already
available in registers). So the old saying "A program has at least two
bugs" seems to hold for gcc optimizer...

Hope this will help,
-Pavel
 
T

tni

Pavel said:
I did something similar recently. What I found out was that, on modern
Intel platforms at least, the time required by multiplications
themselves clearly dominated.

Modern Intel CPUs are quite good for 32 -> 64 bit multiplications. For
something like:

uint32_t v1, v2;
uint64_t result = uint64_t(v1) * uint64_t(v2);

both GCC 3/4 and Visual Studio 2005/2008 generate a single 'mul'
instruction (that will take 1-5 clock cycles on a core2), so any
function call would dominate the timing. (Visual Studio does generate
very lousy 32-bit code for the signed version though.)

If you have a 64-bit platform, Visual Studio and GCC offer support for
64 -> 128 bit multiplies.
 
J

James Kanze

Hi all,
A petty optimization question:
I'm working on optimizing a multi-precision integer library and
wonder what the fastest way to return two unsigned longs would be.
Part of my code is a schoolbook multiplication algorithm which calls a
word-size multiplication algorithm and returns the double width result
in an array. For example
unsigned long result[2];
...
do {
mulUL(factor1, factor2, result);
...
} while(still_computing);
In an ideal world, mulUL() is just a wrapper for the assembly
language instructions and it would be great if "result" stayed
in the registers. They're used right away and then discarded.
I thought about writing a small class that contained two
unsigned longs (faking unsigned long long) and writing instead
ull_t res = mulUL(factor1, factor2);
Or even "constructing" a product
mul_result_t product(factor1, factor2);
But I figure this will just cost me a stack pointer adjustment
every loop iteration.
Any advice on how to encourage the "best" behaviour here?

It depends on the compiler. In theory, a compiler could return
a small structure in several registers, rather than in memory.
In practice, I don't know if any do. (Note that in the general
case, this is determined by the platform ABI, so it's difficult
for a single compiler to do much here.)
 
R

robertwessel2

Hi all,
  A petty optimization question:
  I'm working on optimizing a multi-precision integer library and
wonder what the fastest way to return two unsigned longs would be.
Part of my code is a schoolbook multiplication algorithm which calls a
word-size multiplication algorithm and returns the double width result
in an array. For example
    unsigned long result[2];
    ...
    do {
        mulUL(factor1, factor2, result);
        ...
    } while(still_computing);
In an ideal world, mulUL() is just a wrapper for the assembly
language instructions and it would be great if "result" stayed
in the registers. They're used right away and then discarded.
I thought about writing a small class that contained two
unsigned longs (faking unsigned long long) and writing instead
     ull_t res = mulUL(factor1, factor2);
Or even "constructing" a product
     mul_result_t product(factor1, factor2);
But I figure this will just cost me a stack pointer adjustment
every loop iteration.
Any advice on how to encourage the "best" behaviour here?

It depends on the compiler.  In theory, a compiler could return
a small structure in several registers, rather than in memory.
In practice, I don't know if any do.  (Note that in the general
case, this is determined by the platform ABI, so it's difficult
for a single compiler to do much here.)


FWIW, the standard Win32 calling conventions will return a small
(eight byte, and POD) struct in eax/edx.

http://msdn.microsoft.com/en-us/library/984x0h58.aspx
 
P

Pavel

tni said:
Modern Intel CPUs are quite good for 32 -> 64 bit multiplications. For
something like:

uint32_t v1, v2;
uint64_t result = uint64_t(v1) * uint64_t(v2);

both GCC 3/4 and Visual Studio 2005/2008 generate a single 'mul'
I am not sure what instruction you refer to. Do you mean IMUL?
instruction (that will take 1-5 clock cycles on a core2), so any
1-5 clock cycles (actually 1-3 for ) is a throughput of these
instructions (that is, how fast you can give the next back-to-back IMUL
instruction ; however, the latency is at least 10 cycles (depending on
the core model). OP's problem will require 4 IMUL instructions (unless
he resorts to assembler which does not seem to be in the cards) and
intermediate ADDs may make next IMUL wait till the end of ADD.
function call would dominate the timing. (Visual Studio does generate
very lousy 32-bit code for the signed version though.)

If you have a 64-bit platform, Visual Studio and GCC offer support for
64 -> 128 bit multiplies.

This would help to only do a single IMUL. That will still take 10 cycles
though (actually, up to 14, depending on a particular model of the
processor architecture) on Intel.

-Pavel
 
T

tni

Pavel said:
I am not sure what instruction you refer to. Do you mean IMUL?

No, I mean 'mul' (unsigned multiply).
1-5 clock cycles (actually 1-3 for ) is a throughput of these
instructions

The throughput is 1/cycle on an Intel Core 2, with a latency of 5 clock
cycles (I'm not talking about a P4 which has horrible latencies for
pretty much everything). Multiply is fully pipelined.
(that is, how fast you can give the next back-to-back IMUL
instruction ; however, the latency is at least 10 cycles (depending on
the core model).

No, it doesn't, assuming a 'long' is 32-bit, and a 'long long' 64-bit on
IA-32.

-------------------------------------------------
#include "pstdint.h"
#include <iostream>

__declspec(noinline)
uint64_t mul(uint32_t a, uint32_t b) {
return uint64_t(a) * uint64_t(b);
}

int main(int argc, char* argv[]) {
uint32_t a = std::rand();
uint32_t b = std::rand();
std::cout << mul(a, b) << std::endl;
return 0;
}
--------------------------------------------------

Visual Studio 2005, built for IA-32, disassembled:

uint64_t mul(uint32_t a, uint32_t b) {
return uint64_t(a) * uint64_t(b);
00401000 mul eax,dword ptr [esp+4]
}
00401004 ret


The result is 64 bits, in eax:edx

\\

On x64, the 64-bit x 64-bit -> 128-bit multiply is also a single
instruction (but you have to use a non-standard compiler intrinsic).

------------------------------------------------
#include "pstdint.h"
#include <iostream>
#include <intrin.h>

__declspec(noinline)
void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
res1 = _mul128(a, b, &res2);
}
------------------------------------------------

Visual Studio 2005, built for x64:

void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
res1 = _mul128(a, b, &res2);
0000000140001000 mov rax,rdx
0000000140001003 imul rcx
0000000140001006 mov qword ptr [r9],rdx
0000000140001009 mov qword ptr [r8],rax
}
000000014000100C ret

The result is 128 bits, qword ptr [r9] : qword ptr [r8].

\\
> OP's problem will require 4 IMUL instructions (unless
> he resorts to assembler which does not seem to be in the cards)

As per above, on IA-32/x64, the compiler does the job - no need for
assembly.
> and intermediate ADDs may make next IMUL wait till the end of ADD.

Then you are doing something wrong, e.g. look at:
http://www.mips.com/media/files/white-papers/64 bit architecture speeds RSA4x.pdf

for an algorithm. The 4 required multiplies are independent and can run
in parallel on current IA-32/x64 CPUs (non-x64 P4's don't have a fully
pipelined multiply, they can only do 0.2 muls/clock throughput and have
high-latency shifts which doesn't help either, since those will be in
the dependency chain).
> That will still take 10 cycles though (actually, up to 14, depending
> on a particular model of the processor architecture) on Intel.

Only the P4 has latencies that high, Pentium Pro, P2, P3, PM, Core
Solo/Duo, Core 2, Core i7 all have fast multiplies; AMD Athlon, Athlon
64, Opteron and Phenom are fast as well.
 
P

Pavel

tni said:
Pavel said:
I am not sure what instruction you refer to. Do you mean IMUL?

No, I mean 'mul' (unsigned multiply).
1-5 clock cycles (actually 1-3 for ) is a throughput of these
instructions

The throughput is 1/cycle on an Intel Core 2, with a latency of 5 clock
cycles (I'm not talking about a P4 which has horrible latencies for
pretty much everything). Multiply is fully pipelined.
(that is, how fast you can give the next back-to-back IMUL instruction
; however, the latency is at least 10 cycles (depending on the core
model).

No, it doesn't, assuming a 'long' is 32-bit, and a 'long long' 64-bit on
IA-32.

-------------------------------------------------
#include "pstdint.h"
#include <iostream>

__declspec(noinline)
uint64_t mul(uint32_t a, uint32_t b) {
return uint64_t(a) * uint64_t(b);
}

int main(int argc, char* argv[]) {
uint32_t a = std::rand();
uint32_t b = std::rand();
std::cout << mul(a, b) << std::endl;
return 0;
}
--------------------------------------------------

Visual Studio 2005, built for IA-32, disassembled:

uint64_t mul(uint32_t a, uint32_t b) {
return uint64_t(a) * uint64_t(b);
00401000 mul eax,dword ptr [esp+4]
}
00401004 ret


The result is 64 bits, in eax:edx

\\

On x64, the 64-bit x 64-bit -> 128-bit multiply is also a single
instruction (but you have to use a non-standard compiler intrinsic).

------------------------------------------------
#include "pstdint.h"
#include <iostream>
#include <intrin.h>

__declspec(noinline)
void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
res1 = _mul128(a, b, &res2);
}
------------------------------------------------

Visual Studio 2005, built for x64:

void mul128(int64_t a, int64_t b, int64_t& res1, int64_t& res2) {
res1 = _mul128(a, b, &res2);
0000000140001000 mov rax,rdx
0000000140001003 imul rcx
0000000140001006 mov qword ptr [r9],rdx
0000000140001009 mov qword ptr [r8],rax
}
000000014000100C ret

The result is 128 bits, qword ptr [r9] : qword ptr [r8].

\\
OP's problem will require 4 IMUL instructions (unless
he resorts to assembler which does not seem to be in the cards)

As per above, on IA-32/x64, the compiler does the job - no need for
assembly.
and intermediate ADDs may make next IMUL wait till the end of ADD.

Then you are doing something wrong, e.g. look at:
http://www.mips.com/media/files/white-papers/64 bit architecture speeds RSA4x.pdf


for an algorithm. The 4 required multiplies are independent and can run
in parallel on current IA-32/x64 CPUs (non-x64 P4's don't have a fully
pipelined multiply, they can only do 0.2 muls/clock throughput and have
high-latency shifts which doesn't help either, since those will be in
the dependency chain).
That will still take 10 cycles though (actually, up to 14, depending
on a particular model of the processor architecture) on Intel.

Only the P4 has latencies that high, Pentium Pro, P2, P3, PM, Core
Solo/Duo, Core 2, Core i7 all have fast multiplies; AMD Athlon, Athlon
64, Opteron and Phenom are fast as well.

For MUL, "INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES" gives 10 cycles
latency for the model they refer to as used for Core 2 (From C.3.1: "The
column represented by 0xF3n also applies to Intel
processors with CPUID signature 0xF4n and 0xF6n".."‘n’ indicates it
applies to any value in the stepping encoding".."Intel Core Solo and
Intel Core Duo processors are represented by 06_0EH") See the bottom of
the table C-13 there, the lowest latency for MUL is shown there for the
model 0xF3H and it says 10 cycles. And my testing attested for every of
these 10 cycles, unfortunately :(.

-Pavel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top