C / asm / long ints

F

fermineutron

A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow, it would take almost a second to multiply 2 array
encoded integers. resently i looked at large precision libraries for C,
in particular GMP but i was unable to get it to run with my compiler,
aperantly some header files were not found. I was curious what is the
best way to interface C and asm so that i could write simmilar library
in asm and use it in C.

Any ideas?
 
R

Richard Heathfield

fermineutron said:
A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow, it would take almost a second to multiply 2 array
encoded integers. resently i looked at large precision libraries for C,
in particular GMP but i was unable to get it to run with my compiler,
aperantly some header files were not found. I was curious what is the
best way to interface C and asm so that i could write simmilar library
in asm and use it in C.

Assembly language is not a magic wand you can wave to turn bad algorithms
into good ones. If your C routines weren't quick enough, your assembly
language routines are very unlikely to be quick enough either.

Choose Better Algorithms. Then implement them in C.
 
M

mensanator

fermineutron said:
A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow, it would take almost a second to multiply 2 array
encoded integers. resently i looked at large precision libraries for C,
in particular GMP but i was unable to get it to run with my compiler,
aperantly some header files were not found. I was curious what is the
best way to interface C and asm so that i could write simmilar library
in asm and use it in C.

Any ideas?

It would probably be easier to figure out how to make GMP
work on your system. And worth the effort.
 
J

Jack \Abram\ Off

fermineutron said:
A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow

You probably could write assembly to to the manipulations that you do in
your multiply procedure "a little less slowly", but I suspect what you
*really* want is something borrowed from higher maths: The FFT multiply.

http://numbers.computation.free.fr/Constants/Algorithms/fft.html

Variations on this theme are found in all kinds of programming
situations where really big numbers are involved, and in your own
experiments you stumbled upon the itch that is scratched by this approach.

If approximate solutions to n! would be useful in your application,
Stirling gave us something in the 18th century that finds all kinds
of uses in computing today:
http://mathworld.wolfram.com/StirlingsApproximation.html

If you study computer science in a university setting, you will see
Stirling's formula in the first course of discrete maths, and the DFT
("Discrete Fourier Transform") will come back to haunt you a few times
in various algorithm and automata courses. (Or at least it *should*;
some schools appear to not actually teach Computer Science.)
 
N

Nishu

fermineutron said:
A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow, it would take almost a second to multiply 2 array
encoded integers. resently i looked at large precision libraries for C,
in particular GMP but i was unable to get it to run with my compiler,
aperantly some header files were not found. I was curious what is the
best way to interface C and asm so that i could write simmilar library
in asm and use it in C.

Some compilers support embedded asm but it is waste of time unless you
understand the underlined processor architecture, its related
instruction set, how your compiler generates the corresponding asm for
your optimized C routine and so on. It is big time investment.
Relatively, less time consuming method with appreciable return is to
optimize the C routine itself; and there are various techinques to do
so which are easily available on the net and also in some previous
threads in this group.

-Nishu
 
C

Chris Thomasson

Nishu said:
fermineutron wrote:
[...]
I was curious what is the
best way to interface C and asm so that i could write simmilar library
in asm and use it in C.

[...]

You make you libraries ABI follow the C calling convention for the processor
your targeting. For x86 prams are passed on the stack left-to-right, and
SPARC passed prams mostly in registers', ect...

Here are two examples of how to use IA-32 assembly language to build a
library that has a strict C API, and an ABI that follows the C calling
convention for the IA-32:

http://appcore.home.comcast.net/

http://appcore.home.comcast.net/vzoom/refcount/
 
R

Richard Bos

Jack \"Abram\" Off said:
You probably could write assembly to to the manipulations that you do in
your multiply procedure "a little less slowly", but I suspect what you
*really* want is something borrowed from higher maths: The FFT multiply.

http://numbers.computation.free.fr/Constants/Algorithms/fft.html

That's nice and fast, but it's a floating-point process and introduces
errors. All very well for floating point computations which are already
imprecise, but if you start out with a nice, exact integer array it's
probably not what you want.

Richard
 
R

Rod Pemberton

fermineutron said:
A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow, it would take almost a second to multiply 2 array
encoded integers. resently i looked at large precision libraries for C,
in particular GMP but i was unable to get it to run with my compiler,
aperantly some header files were not found. I was curious what is the
best way to interface C and asm so that i could write simmilar library
in asm and use it in C.

Any ideas?

About the only realistic way you're going to get x86 assembly to outperform
highly optimized C, is to get a true x86 assembly expert, like Terje
Mathisen, to code it for you.

Most current C compilers are extremely efficient in generating assembly.
I've gone to great lengths to outcode C compilers for x86 in a couple
special situations. The best I could do was almost match the C compiler...
The problem for you is that the C optimizer can take full advantage of
extremely complicated situations and special instructions to generate the
best code. In fact, some C optimizers generate hundreds of trial
combinations. Most people just can't handle such complexity or convoluted
situations.

Are there ways to make your C code faster? Yes.

1) buy a new computer, a 2Ghz AMD is roughly 1000 times faster than a 500
Mhz AMD x86 CPU
2) find a better algorithm, a brute force factorization may take a second or
two, many times slower than an elliptic curve factorization
3) switch from, say GCC, to a compiler which is known for more efficient
code, say OpenWatcom or Digital Mars
4) completely unroll any loops, this reduces branching which is always
expensive in assembly
5) completely unroll any loops, occasionally the loop size can be reduced by
one, depending on how the loop was coded
6) precompute as many operations as possible, even extremely large lookup
tables are much faster than computation
7) make an attempt to reduce the number of variables used in the
calculations
8) replace multiplications and divisions with additions, subtractions,
bitshifts
9) don't attempt to access integer data smaller than the largest assembly
integer type of the CPU (32-bits for 32-bit cpu, 64-bit for...)
10) play around with a decent number of compiler optimizations, usually a
small number of them will other the most improvement
11) although C compilers are very good with optimization, they aren't
perfect. Forcing the use of a register can improve the code's speed


Rod Pemberton
 
R

Richard Heathfield

Rod Pemberton said:

Are there ways to make your C code faster? Yes.

1) buy a new computer, a 2Ghz AMD is roughly 1000 times faster than a 500
Mhz AMD x86 CPU

That ought to be unnecessary.
2) find a better algorithm

Of all your suggestions, this is the best. When the OP first posted on this
subject (factorial calculations using bignums), it was immediately evident
that his algorithms were suspect, since their implementations were very
very very much slower (by orders of magnitude, IIRC) than my own routines,
which I wrote with more regard to clarity than to speed.

<snip>
 
C

CBFalconer

Chris said:
.... snip ...

You make you libraries ABI follow the C calling convention for the
processor your targeting. For x86 prams are passed on the stack
left-to-right, and SPARC passed prams mostly in registers', ect...

Is that rather hard on the babies in the prams? Child abuse? :)
 
R

Rod Pemberton

Richard Heathfield said:
Rod Pemberton said:




That ought to be unnecessary.


Of all your suggestions, this is the best. When the OP first posted on this
subject (factorial calculations using bignums), it was immediately evident
that his algorithms were suspect, since their implementations were very
very very much slower (by orders of magnitude, IIRC) than my own routines,
which I wrote with more regard to clarity than to speed.

<snip>


Of all your suggestions, this is the best.

I read your post. I originally had the line: "I have to agree with
Healthfield." but thought better of it... ;)

A poor algorithm may be the _cause_ of his problems, but, the _best_
suggestion, IMO, is to continue to use C and not switch to assembly. He
could spend his lifetime attempting to get his assembly to outperform
optimized C.

All of the other methods work and are easy enough to implement too. If he
applies all of them, except for two that offer the most improvement:
1) finding a better algorithm
2) buying a new computer
he could still easily see a 300-900% improvement in speed.

One has to ask why he wants large to compute large factorials. The usual
answer leads indirectly to something which could be used to break
encryption, or directly to the breaking of some encryption scheme. So, I
made the assumption that the OP wanted the absolutely fastest implementation
he could obtain without spending years coding assembly and without spending
a fortune. The easiest way is to spend some money for faster equipment.
You can buy an extremely fast PC for leas than $2k USD, or a 16-core Tyan
Typhoon PSC for 10-16$k USD.


Rod Pemberton
 
M

Mark F. Haigh

Rod Pemberton wrote:
About the only realistic way you're going to get x86 assembly to outperform
highly optimized C, is to get a true x86 assembly expert, like Terje
Mathisen, to code it for you.
<snip>

Agreed for the most part, but one can often realize speed gains of an
order of magnitude or more by using special-purpose assembly
instructions that current-generation compilers generally do not emit.
A compiler will generally optimize the hell out of loops dealing with
bits in unsigned integers, but will generally not (yet) understand that
the entire thing could be perhaps replaced by a popcntbd (POWER5
population count) or a bsr (Pentium 4 bit scan reverse) depending on
the loop, etc.

If such a thing is at the heart of something like a packet classifier
or image filter, even an assembly non-expert can affect a measurable
performance gain.


Mark F. Haigh
(e-mail address removed)
 
R

Rod Pemberton

Mark F. Haigh said:
Rod Pemberton wrote:

<snip>

Agreed for the most part, but one can often realize speed gains of an
order of magnitude or more by using special-purpose assembly
instructions that current-generation compilers generally do not emit.
A compiler will generally optimize the hell out of loops dealing with
bits in unsigned integers, but will generally not (yet) understand that
the entire thing could be perhaps replaced by a popcntbd (POWER5
population count) or a bsr (Pentium 4 bit scan reverse) depending on
the loop, etc.

If such a thing is at the heart of something like a packet classifier
or image filter, even an assembly non-expert can affect a measurable
performance gain.

Although I no problem with your point there may be some situations where
assembly is better, I currently believe that most of the instructions which
aren't emitted by x86 compilers is because they aren't a good choice for
speed. Using BSR as an example of a special-purpose instruction which could
give performance gains was probably a bad example. The timings I have
access to indicate that BSR can be very slow and it is non-pairable on the
P4. It's not likely any compiler would use it in a situation requiring
speed, but they may use it to reduce code size.

cycle timings for BSR reg,reg:
386 10+3n
486 6-103
P4 7-73


Rod Pemberton
 
M

Mark F. Haigh

Rod said:
Although I no problem with your point there may be some situations where
assembly is better, I currently believe that most of the instructions which
aren't emitted by x86 compilers is because they aren't a good choice for
speed. Using BSR as an example of a special-purpose instruction which could
give performance gains was probably a bad example. The timings I have
access to indicate that BSR can be very slow and it is non-pairable on the
P4.

The timings that I have access to indicate that bsr completes in a
single cycle on the newer Core 2 Intel chips (Family 6, Model 0xF), so
I will disagree with you on that. Whether or not this is a real
"Pentium 4" is open to question, I suppose, but that's starting to
drift off topic.

I just pulled bsr off the top of my head anyways, as I have personally
seen it dramatically decrease bitmap scanning times compared to a C
language bit-scanning loop.

Mark F. Haigh
(e-mail address removed)
 
S

stevenj

Richard said:
That's nice and fast, but it's a floating-point process and introduces
errors. All very well for floating point computations which are already
imprecise, but if you start out with a nice, exact integer array it's
probably not what you want.

You can perform convolutions of integer arrays (i.e.
arbitrary-precision multiplies) *exactly* with a floating-point FFT as
long as each integer is stored in a floating-point number with much
larger precision, and this is precisely what is done in practice by
most arbitrary-precision arithmetic packages.

Steven
 
R

Richard Bos

You can perform convolutions of integer arrays (i.e.
arbitrary-precision multiplies) *exactly* with a floating-point FFT as
long as each integer is stored in a floating-point number with much
larger precision, and this is precisely what is done in practice by
most arbitrary-precision arithmetic packages.

Myes. But are _much_ larger floating-point numbers really that much
faster than appropriately handled integers? Even if you don't have these
much larger FPs yet, and will have to emulate them? Don't get me wrong,
I see the value of the method, but I'm not so sure of its value to the
OP, who probably will have the same problems handling extended-precision
FPs that he has handling extended-size integers.

Richard
 
R

Richard Bos

Chris Thomasson said:
You make you libraries ABI follow the C calling convention for the processor
your targeting. For x86 prams are passed on the stack left-to-right,

Oh, are they? That'll be news to the ISO C Standard, which allows
implementations to pull any trick they like to speed up function calls.

And that's important; if you want an assembly function to speed up the
program, you really don't want to use the slowest way available to call
that function.

Richard
 
A

Ancient_Hacker

fermineutron said:
A while back i tried to calculate factorials of large numbers using
arrays in C, the array encoded integer arithemetic that i wrote in C
was very slow, it would take almost a second to multiply 2 array
encoded integers.


Whoa! Did you just stuff one decimal digit per array element? Or one
bit? It's hard to make arithmetic that slow.

Try using an array of unsigned ints, keeping up to sizeof( int ) * 8
bits per element. The code should run many many times faster. For
instance if you were storing one decimal digit per element, using
32-bit ints will make it run at least 400 million times faster.
Really. You eliminate having to do a mod and divide per loop, and you
have 2^32/10 times fewer loops.

Or better yet find a working bignum library, there must be dozens of
them for C.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,008
Latest member
HaroldDark

Latest Threads

Top