# Re: Python speed vs csharp

Discussion in 'Python' started by Bengt Richter, Aug 1, 2003.

1. ### Bengt RichterGuest

On Wed, 30 Jul 2003 23:09:22 -0700, Mike <> wrote:

>Bear with me: this post is moderately long, but I hope it is relatively
>succinct.
>
>I've been using Python for several years as a behavioral modeling tool for
>the circuits I design. So far, it's been a good trade-off: compiled C++
>would run faster, but the development time of Python is so much faster, and
>the resulting code is so much more reliable after the first pass, that I've
>never been tempted to return to C++. Every time I think stupid thoughts
>like, "I'll bet I could do this in C++," I get out my copy of Scott Meyers'
>"Effecive C++," and I'm quickly reminded why it's better to stick with
>Python (Meyers is a very good author, but points out lots of quirks and
>pitfalls with C++ that I keep thinking that I shouldn't have to worry
>about, much less try to remember). Even though Python is wonderful in that
>regard, there are problems.
>
>Here's the chunk of code that I'm spending most of my time executing:
>
># Rational approximation for erfc(x) (Abramowitz & Stegun, Sec. 7.1.26)
># Fifth order approximation. |error| <= 1.5e-7 for all x
>#
>def erfc( x ):
> p = 0.3275911
> a1 = 0.254829592
> a2 = -0.284496736
> a3 = 1.421413741
> a4 = -1.453152027
> a5 = 1.061405429
>
> t = 1.0 / (1.0 + p*float(x))
> erfcx = ( (a1 + (a2 + (a3 +
> (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
> return erfcx
>
>This is an error function approximation, which gets called around 1.5
>billion times during the simulation, and takes around 3500 seconds (just
>under an hour) to complete. While trying to speed things up, I created a
>simple test case with the code above and a main function to call it 10
>million times. The code takes roughly 210 seconds to run.

I recoded the above and also in-lined the code, and also made a C DLL module version
and timed them roughly:

====< erfcopt.py >=====================================================
import math
def erfc_old( x ):
p = 0.3275911
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429

t = 1.0 / (1.0 + p*float(x))
erfcx = ( (a1 + (a2 + (a3 +
(a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
return erfcx

# You know that all those "constant" assignments are
# actually executed every time erfc is called,
# right? To move that to def-time instead of
# call-time, you can write

def erfc( x, # constants bound at def time, rparen moved down, commas added ):
p = 0.3275911,
a1 = 0.254829592,
a2 = -0.284496736,
a3 = 1.421413741,
a4 = -1.453152027,
a5 = 1.061405429,
exp = math.exp
): # just to make it stand out
t = 1.0 / (1.0 + p*x) #XXX# don't need float(), since p is float
return ( (a1 + (a2 + (a3 +
(a4 + a5*t)*t)*t)*t)*t ) * exp(-x**2) #XXX# elim attr lookup on global math.exp
# elim intermediate store of erfcx #XXX# return erfcx

def test():
from time import clock
from erfc import erfc as erfc_in_c

# sanity check
for x in range(20): assert erfc_old(x) == erfc(x) and abs(erfc_in_c(x)-erfc_old(x))<1e-18

t1=clock()
for i in xrange(100000): erfc_old(1.2345)
t2=clock()
for i in xrange(100000): erfc(1.2345)
t3=clock()
# inlining code from old version:
# constant assignments hoisted out of loop, of course
p = 0.3275911
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429
x = 1.2345
for i in xrange(100000):
t = 1.0 / (1.0 + p*float(x))
erfcx = ( (a1 + (a2 + (a3 +
(a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
t4 = clock()
for i in xrange(100000): erfc_in_c(1.2345)
t5=clock()
print 'old: %f, new: %f, inline: %f, in_c %f' %(t2-t1, t3-t2, t4-t3, t5-t4)

import dis
print '---- erfc_old code -----------------------------------'
dis.dis(erfc_old)
print '---- erfc --------------------------------------------'
dis.dis(erfc)

if __name__== '__main__':
test()
=======================================================================

Result is:

[19:48] C:\pywk\cstuff>erfcopt.py
old: 4.490975, new: 3.123005, inline: 3.051674, in_c 0.806907

I was surprised that the inline gain was not more, but I guess there was enough computation
to obscure that cost. Anyway, moving some of the work to def-time paid off about 30%. But
going to C cut 82%.

I added the C version to the sanity check, and had to allow an epsilon. I suspect that this
is because the C version probably keeps the whole problem in the FPU with 64 fractional bits
of precision until returning the final double (which has 53 bits), wherease the Python
algorithm presumably stores all intermediate values in memory with the 53-bit precision,
and so loses in the overall. That's my theory, anyway.

To see the extra work the old code is doing vs the new, you can see below:

---- erfc_old code -----------------------------------
0 SET_LINENO 2

3 SET_LINENO 3
9 STORE_FAST 3 (p)

12 SET_LINENO 4
18 STORE_FAST 2 (a1)

21 SET_LINENO 5
27 STORE_FAST 5 (a2)

30 SET_LINENO 6
36 STORE_FAST 4 (a3)

39 SET_LINENO 7
45 STORE_FAST 7 (a4)

48 SET_LINENO 8
54 STORE_FAST 6 (a5)

Everything above here was unnecessary to do at run-time

57 SET_LINENO 10
75 CALL_FUNCTION 1
78 BINARY_MULTIPLY
80 BINARY_DIVIDE
81 STORE_FAST 8 (t)

84 SET_LINENO 11
105 BINARY_MULTIPLY
110 BINARY_MULTIPLY
115 BINARY_MULTIPLY
120 BINARY_MULTIPLY
125 BINARY_MULTIPLY
138 BINARY_POWER
139 UNARY_NEGATIVE
140 CALL_FUNCTION 1
143 BINARY_MULTIPLY
144 STORE_FAST 1 (erfcx) <<--+
|
147 SET_LINENO 13 <<--+
150 LOAD_FAST 1 (erfcx) <<--+-- all eliminated
153 RETURN_VALUE
157 RETURN_VALUE
---- erfc --------------------------------------------
0 SET_LINENO 20

3 SET_LINENO 29
18 BINARY_MULTIPLY
20 BINARY_DIVIDE
21 STORE_FAST 8 (t)

24 SET_LINENO 30
45 BINARY_MULTIPLY
50 BINARY_MULTIPLY
55 BINARY_MULTIPLY
60 BINARY_MULTIPLY
65 BINARY_MULTIPLY
75 BINARY_POWER
76 UNARY_NEGATIVE
77 CALL_FUNCTION 1
80 BINARY_MULTIPLY
81 RETURN_VALUE <<--
85 RETURN_VALUE

[19:49] C:\pywk\cstuff>

Regards,
Bengt Richter

Bengt Richter, Aug 1, 2003

2. ### John MachinGuest

(Bengt Richter) wrote in message news:<bgds3g\$f95\$0@216.39.172.122>...
> erfcx = ( (a1 + (a2 + (a3 +
> (a4 + a5*t)*t)*t)*t)*t ) * exp(-pow(x,2.0));

Wouldn't (x*x) be better than pow(x,2.0) ?

John Machin, Aug 2, 2003