Re: Python speed vs csharp

Discussion in 'Python' started by Bengt Richter, Aug 1, 2003.

  1. On Wed, 30 Jul 2003 23:09:22 -0700, Mike <> wrote:

    >Bear with me: this post is moderately long, but I hope it is relatively
    >succinct.
    >
    >I've been using Python for several years as a behavioral modeling tool for
    >the circuits I design. So far, it's been a good trade-off: compiled C++
    >would run faster, but the development time of Python is so much faster, and
    >the resulting code is so much more reliable after the first pass, that I've
    >never been tempted to return to C++. Every time I think stupid thoughts
    >like, "I'll bet I could do this in C++," I get out my copy of Scott Meyers'
    >"Effecive C++," and I'm quickly reminded why it's better to stick with
    >Python (Meyers is a very good author, but points out lots of quirks and
    >pitfalls with C++ that I keep thinking that I shouldn't have to worry
    >about, much less try to remember). Even though Python is wonderful in that
    >regard, there are problems.
    >
    >Here's the chunk of code that I'm spending most of my time executing:
    >
    ># Rational approximation for erfc(x) (Abramowitz & Stegun, Sec. 7.1.26)
    ># Fifth order approximation. |error| <= 1.5e-7 for all x
    >#
    >def erfc( x ):
    > p = 0.3275911
    > a1 = 0.254829592
    > a2 = -0.284496736
    > a3 = 1.421413741
    > a4 = -1.453152027
    > a5 = 1.061405429
    >
    > t = 1.0 / (1.0 + p*float(x))
    > erfcx = ( (a1 + (a2 + (a3 +
    > (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
    > return erfcx
    >
    >This is an error function approximation, which gets called around 1.5
    >billion times during the simulation, and takes around 3500 seconds (just
    >under an hour) to complete. While trying to speed things up, I created a
    >simple test case with the code above and a main function to call it 10
    >million times. The code takes roughly 210 seconds to run.


    I recoded the above and also in-lined the code, and also made a C DLL module version
    and timed them roughly:

    ====< erfcopt.py >=====================================================
    import math
    def erfc_old( x ):
    p = 0.3275911
    a1 = 0.254829592
    a2 = -0.284496736
    a3 = 1.421413741
    a4 = -1.453152027
    a5 = 1.061405429

    t = 1.0 / (1.0 + p*float(x))
    erfcx = ( (a1 + (a2 + (a3 +
    (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
    return erfcx

    # You know that all those "constant" assignments are
    # actually executed every time erfc is called,
    # right? To move that to def-time instead of
    # call-time, you can write

    def erfc( x, # constants bound at def time, rparen moved down, commas added ):
    p = 0.3275911,
    a1 = 0.254829592,
    a2 = -0.284496736,
    a3 = 1.421413741,
    a4 = -1.453152027,
    a5 = 1.061405429,
    exp = math.exp
    ): # just to make it stand out
    t = 1.0 / (1.0 + p*x) #XXX# don't need float(), since p is float
    return ( (a1 + (a2 + (a3 +
    (a4 + a5*t)*t)*t)*t)*t ) * exp(-x**2) #XXX# elim attr lookup on global math.exp
    # elim intermediate store of erfcx #XXX# return erfcx

    def test():
    from time import clock
    from erfc import erfc as erfc_in_c

    # sanity check
    for x in range(20): assert erfc_old(x) == erfc(x) and abs(erfc_in_c(x)-erfc_old(x))<1e-18

    t1=clock()
    for i in xrange(100000): erfc_old(1.2345)
    t2=clock()
    for i in xrange(100000): erfc(1.2345)
    t3=clock()
    # inlining code from old version:
    # constant assignments hoisted out of loop, of course
    p = 0.3275911
    a1 = 0.254829592
    a2 = -0.284496736
    a3 = 1.421413741
    a4 = -1.453152027
    a5 = 1.061405429
    x = 1.2345
    for i in xrange(100000):
    t = 1.0 / (1.0 + p*float(x))
    erfcx = ( (a1 + (a2 + (a3 +
    (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
    t4 = clock()
    for i in xrange(100000): erfc_in_c(1.2345)
    t5=clock()
    print 'old: %f, new: %f, inline: %f, in_c %f' %(t2-t1, t3-t2, t4-t3, t5-t4)

    import dis
    print '---- erfc_old code -----------------------------------'
    dis.dis(erfc_old)
    print '---- erfc --------------------------------------------'
    dis.dis(erfc)

    if __name__== '__main__':
    test()
    =======================================================================

    Result is:


    [19:48] C:\pywk\cstuff>erfcopt.py
    old: 4.490975, new: 3.123005, inline: 3.051674, in_c 0.806907

    I was surprised that the inline gain was not more, but I guess there was enough computation
    to obscure that cost. Anyway, moving some of the work to def-time paid off about 30%. But
    going to C cut 82%.

    I added the C version to the sanity check, and had to allow an epsilon. I suspect that this
    is because the C version probably keeps the whole problem in the FPU with 64 fractional bits
    of precision until returning the final double (which has 53 bits), wherease the Python
    algorithm presumably stores all intermediate values in memory with the 53-bit precision,
    and so loses in the overall. That's my theory, anyway.

    To see the extra work the old code is doing vs the new, you can see below:

    ---- erfc_old code -----------------------------------
    0 SET_LINENO 2

    3 SET_LINENO 3
    6 LOAD_CONST 1 (0.32759110000000002)
    9 STORE_FAST 3 (p)

    12 SET_LINENO 4
    15 LOAD_CONST 2 (0.25482959199999999)
    18 STORE_FAST 2 (a1)

    21 SET_LINENO 5
    24 LOAD_CONST 3 (-0.28449673599999997)
    27 STORE_FAST 5 (a2)

    30 SET_LINENO 6
    33 LOAD_CONST 4 (1.4214137410000001)
    36 STORE_FAST 4 (a3)

    39 SET_LINENO 7
    42 LOAD_CONST 5 (-1.453152027)
    45 STORE_FAST 7 (a4)

    48 SET_LINENO 8
    51 LOAD_CONST 6 (1.0614054289999999)
    54 STORE_FAST 6 (a5)

    Everything above here was unnecessary to do at run-time

    57 SET_LINENO 10
    60 LOAD_CONST 7 (1.0)
    63 LOAD_CONST 7 (1.0)
    66 LOAD_FAST 3 (p)
    69 LOAD_GLOBAL 6 (float)
    72 LOAD_FAST 0 (x)
    75 CALL_FUNCTION 1
    78 BINARY_MULTIPLY
    79 BINARY_ADD
    80 BINARY_DIVIDE
    81 STORE_FAST 8 (t)

    84 SET_LINENO 11
    87 LOAD_FAST 2 (a1)
    90 LOAD_FAST 5 (a2)
    93 LOAD_FAST 4 (a3)
    96 LOAD_FAST 7 (a4)
    99 LOAD_FAST 6 (a5)
    102 LOAD_FAST 8 (t)
    105 BINARY_MULTIPLY
    106 BINARY_ADD
    107 LOAD_FAST 8 (t)
    110 BINARY_MULTIPLY
    111 BINARY_ADD
    112 LOAD_FAST 8 (t)
    115 BINARY_MULTIPLY
    116 BINARY_ADD
    117 LOAD_FAST 8 (t)
    120 BINARY_MULTIPLY
    121 BINARY_ADD
    122 LOAD_FAST 8 (t)
    125 BINARY_MULTIPLY
    126 LOAD_GLOBAL 9 (math) <<--+-- replaced by a single LOAD_FAST
    129 LOAD_ATTR 10 (exp) <<--+
    132 LOAD_FAST 0 (x)
    135 LOAD_CONST 8 (2)
    138 BINARY_POWER
    139 UNARY_NEGATIVE
    140 CALL_FUNCTION 1
    143 BINARY_MULTIPLY
    144 STORE_FAST 1 (erfcx) <<--+
    |
    147 SET_LINENO 13 <<--+
    150 LOAD_FAST 1 (erfcx) <<--+-- all eliminated
    153 RETURN_VALUE
    154 LOAD_CONST 0 (None)
    157 RETURN_VALUE
    ---- erfc --------------------------------------------
    0 SET_LINENO 20

    3 SET_LINENO 29
    6 LOAD_CONST 1 (1.0)
    9 LOAD_CONST 1 (1.0)
    12 LOAD_FAST 1 (p)
    15 LOAD_FAST 0 (x)
    18 BINARY_MULTIPLY
    19 BINARY_ADD
    20 BINARY_DIVIDE
    21 STORE_FAST 8 (t)

    24 SET_LINENO 30
    27 LOAD_FAST 2 (a1)
    30 LOAD_FAST 3 (a2)
    33 LOAD_FAST 4 (a3)
    36 LOAD_FAST 5 (a4)
    39 LOAD_FAST 6 (a5)
    42 LOAD_FAST 8 (t)
    45 BINARY_MULTIPLY
    46 BINARY_ADD
    47 LOAD_FAST 8 (t)
    50 BINARY_MULTIPLY
    51 BINARY_ADD
    52 LOAD_FAST 8 (t)
    55 BINARY_MULTIPLY
    56 BINARY_ADD
    57 LOAD_FAST 8 (t)
    60 BINARY_MULTIPLY
    61 BINARY_ADD
    62 LOAD_FAST 8 (t)
    65 BINARY_MULTIPLY
    66 LOAD_FAST 7 (exp) <<--
    69 LOAD_FAST 0 (x)
    72 LOAD_CONST 2 (2)
    75 BINARY_POWER
    76 UNARY_NEGATIVE
    77 CALL_FUNCTION 1
    80 BINARY_MULTIPLY
    81 RETURN_VALUE <<--
    82 LOAD_CONST 0 (None)
    85 RETURN_VALUE

    [19:49] C:\pywk\cstuff>

    Regards,
    Bengt Richter
     
    Bengt Richter, Aug 1, 2003
    #1
    1. Advertising

  2. Bengt Richter

    John Machin Guest

    (Bengt Richter) wrote in message news:<bgds3g$f95$0@216.39.172.122>...
    > erfcx = ( (a1 + (a2 + (a3 +
    > (a4 + a5*t)*t)*t)*t)*t ) * exp(-pow(x,2.0));


    Wouldn't (x*x) be better than pow(x,2.0) ?
     
    John Machin, Aug 2, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Martin v. =?iso-8859-15?q?L=F6wis?=

    Re: Python speed vs csharp

    Martin v. =?iso-8859-15?q?L=F6wis?=, Jul 31, 2003, in forum: Python
    Replies:
    1
    Views:
    456
    Bengt Richter
    Jul 31, 2003
  2. Richie Hindle

    Re: Python speed vs csharp

    Richie Hindle, Jul 31, 2003, in forum: Python
    Replies:
    3
    Views:
    466
    Richie Hindle
    Aug 1, 2003
  3. David M. Cooke

    Re: Python speed vs csharp

    David M. Cooke, Jul 31, 2003, in forum: Python
    Replies:
    4
    Views:
    481
    Andrew Dalke
    Aug 1, 2003
  4. Greg Brunet

    Re: Python speed vs csharp

    Greg Brunet, Aug 1, 2003, in forum: Python
    Replies:
    13
    Views:
    619
    Michele Simionato
    Aug 4, 2003
  5. Mike

    Re: Python speed vs csharp

    Mike, Aug 2, 2003, in forum: Python
    Replies:
    2
    Views:
    325
Loading...

Share This Page