Re: Speed-up for loops

Discussion in 'Python' started by David Cournapeau, Sep 5, 2010.

  1. On Thu, Sep 2, 2010 at 7:02 PM, Michael Kreim <> wrote:
    > Hi,
    >
    > I was comparing the speed of a simple loop program between Matlab and
    > Python.
    >
    > My Codes:
    > $ cat addition.py
    > imax = 1000000000
    > a = 0
    > for i in xrange(imax):
    >    a = a + 10
    > print a
    >
    > $ cat addition.m
    > imax = 1e9;
    > a = 0;
    > for i=0:imax-1
    >    a = a + 10;
    > end
    > disp(a);
    > exit;
    >
    > The results look like this:
    > $ /usr/bin/time --verbose python addition.py
    > 10000000000
    >        Command being timed: "python addition.py"
    >        User time (seconds): 107.30
    >        System time (seconds): 0.08
    >        Percent of CPU this job got: 97%
    >        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:50.09
    >        [...]
    >
    > $ /usr/bin/time --verbose matlab -nodesktop -nosplash -r "addition"
    > [...]
    >    1.0000e+10
    >        Command being timed: "matlab -nodesktop -nosplash -r addition"
    >        User time (seconds): 7.65
    >        System time (seconds): 0.18
    >        Percent of CPU this job got: 94%
    >        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.25
    >        [...]
    >
    > Unfortunately my Python Code was much slower and I do not understand why.


    Getting the above kind of code fast requires the interpreter to be
    clever enough so that it will use native machine operations on a int
    type instead of converting back and forth between internal
    representations. Matlab since version 6 I believe, has a JIT to do
    just that. There is no mature JIT-like implementation of python which
    will give you the same speed up for this exact case today.

    > Or do I have to live with the fact that Matlab beats Python in this example?


    Yes. Without a JIT, python cannot hope to get the same kind of speeds
    for this kind of examples.

    That being said, neither matlab nor matlab are especially good at
    doing what you do in your example - for this exact operation, doing it
    in C or other compiled languages will be at least one order of
    magnitude faster. Generally, you use matlab's vectorized operations,
    and in that case, numpy gives you similar performances (sometimes
    faster, sometimes slower, but in the same ballpark in general).

    cheers,

    David
     
    David Cournapeau, Sep 5, 2010
    #1
    1. Advertising

  2. David Cournapeau

    BartC Guest

    "David Cournapeau" <> wrote in message
    news:...
    > On Thu, Sep 2, 2010 at 7:02 PM, Michael Kreim <>
    > wrote:


    >> imax = 1000000000
    >> a = 0
    >> for i in xrange(imax):
    >> a = a + 10
    >> print a


    >> Unfortunately my Python Code was much slower [than Matlab] and I do not
    >> understand why.

    >
    > Getting the above kind of code fast requires the interpreter to be
    > clever enough so that it will use native machine operations on a int
    > type instead of converting back and forth between internal
    > representations.


    Writing for i in xrange(1000000000) you'd think would give it a clue, but it
    makes no difference.

    > Matlab since version 6 I believe, has a JIT to do
    > just that. There is no mature JIT-like implementation of python which
    > will give you the same speed up for this exact case today.


    >> Or do I have to live with the fact that Matlab beats Python in this
    >> example?

    >
    > Yes. Without a JIT, python cannot hope to get the same kind of speeds
    > for this kind of examples.
    >
    > That being said, neither matlab nor matlab are especially good at
    > doing what you do in your example - for this exact operation, doing it
    > in C or other compiled languages will be at least one order of
    > magnitude faster.


    One order of magnitude (say 10-20x slower) wouldn't be so bad. That's what
    you might expect for a dynamically typed, interpreted language.

    But on my machine this code was more like 50-200x slower than C, for
    unaccelerated Python.

    > Generally, you use matlab's vectorized operations,
    > and in that case, numpy gives you similar performances (sometimes
    > faster, sometimes slower, but in the same ballpark in general).


    That would simply be delegating Python to a scripting language. It would be
    nice if you could directly code low-level algorithms in it without relying
    on accelerators, and not have to wait two and a half minutes (or whatever)
    for a simple test to complete.

    --
    bartc
     
    BartC, Sep 5, 2010
    #2
    1. Advertising

  3. On Sun, 05 Sep 2010 12:28:47 +0100, BartC wrote:

    >> Getting the above kind of code fast requires the interpreter to be
    >> clever enough so that it will use native machine operations on a int
    >> type instead of converting back and forth between internal
    >> representations.

    >
    > Writing for i in xrange(1000000000) you'd think would give it a clue,
    > but it makes no difference.


    CPython takes a very conservative view towards runtime optimizations.
    Optimizations don't happen for free, you know, they have costs. Memory is
    one cost, but human time and effort is another.

    But if you want a JIT compiler, see Psycho, or try PyPy, which is very
    exciting and I hope will one day be ready to take over from CPython as
    the first choice for production use.

    [...]
    > One order of magnitude (say 10-20x slower) wouldn't be so bad. That's
    > what you might expect for a dynamically typed, interpreted language.
    >
    > But on my machine this code was more like 50-200x slower than C, for
    > unaccelerated Python.


    I'd say that 50-200 times slower than C is exactly what I'd expect from a
    dynamically typed language like Python without any fancy JIT tricks.
    Getting such a language to within an order of magnitude of C is quite an
    achievement.


    >> Generally, you use matlab's vectorized operations, and in that case,
    >> numpy gives you similar performances (sometimes faster, sometimes
    >> slower, but in the same ballpark in general).

    >
    > That would simply be delegating Python to a scripting language.


    That's sheer unadulterated nonsense.

    In any case, Python was designed as a glue language, specifically to be a
    high-level user-friendly language for gluing components written in C
    together. That's what Python *is* -- it provides a bunch of primitives,
    written in C (or Java, or dot-Net, pick whichever implementation you
    like) and manipulated in a friendly, safe language. Calling numpy for
    fast vectorized operations is *exactly* the right solution, if you need
    high-performance maths calculations.

    Use the right tool for the job, don't insist that your spanner should
    double as a screwdriver.


    > It would
    > be nice if you could directly code low-level algorithms in it without
    > relying on accelerators, and not have to wait two and a half minutes (or
    > whatever) for a simple test to complete.


    Yes, and it would be nice if my clothes washed and ironed themselves, but
    they don't.

    Somebody has to do the work. Are you volunteering to write the JIT
    compiler for CPython? Will you contribute to the PyPy project, or help
    maintain Psycho, or are you just bitching?

    The simple fact is that there are far more important things for Python
    developers to spend their time and effort on than optimizations like this.

    If such an optimization comes out of the PyPy project, I'll be cheering
    them on -- but it's a lot of effort for such a trivial gain.

    The example given by the Original Poster is contrived. Nobody sensibly
    writes an integer multiplication as a repeated addition like that, and
    any real code that would need to be put in a for loop is almost certainly
    going to be too complicated for the JIT compiler to benefit greatly. The
    loop overhead itself will almost certainly be overwhelmed by the work
    done in the loop:

    [steve@sylar ~]$ time python -c "a = 0
    > for i in xrange(10000000):
    > a += 10
    > "


    real 0m6.906s
    user 0m5.820s
    sys 0m0.022s


    which is about double the time for an empty loop of the same size.



    --
    Steven
     
    Steven D'Aprano, Sep 5, 2010
    #3
  4. David Cournapeau

    BartC Guest

    "Steven D'Aprano" <> wrote in message
    news:4c83b425$0$28657$...
    > On Sun, 05 Sep 2010 12:28:47 +0100, BartC wrote:


    >> It would
    >> be nice if you could directly code low-level algorithms in it without
    >> relying on accelerators, and not have to wait two and a half minutes (or
    >> whatever) for a simple test to complete.

    >
    > Yes, and it would be nice if my clothes washed and ironed themselves, but
    > they don't.
    >
    > Somebody has to do the work. Are you volunteering to write the JIT
    > compiler for CPython? Will you contribute to the PyPy project, or help
    > maintain Psycho, or are you just bitching?


    I've thought about it (writing an independent interpreter). But I don't know
    enough of the language, and a lot of it I don't understand (eg. OOP).
    Besides, I think the language itself deliberately makes it difficult to get
    it up to speed. Some of the reasons might be the ones set down here, in
    Chapter 2:

    http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf

    (Instead, I've confined my efforts to my own projects; the last interpreter
    I worked on did in fact get within a magnitude of C in performance, without
    using JIT or other fancy tricks.)

    > The simple fact is that there are far more important things for Python
    > developers to spend their time and effort on than optimizations like this.


    I don't know, there's seem to be an awful lot of effort going into
    addressing exactly this issue (eg. PyPy, Pscyo, Unladen Swallow,
    Shedskin....)

    > If such an optimization comes out of the PyPy project, I'll be cheering
    > them on -- but it's a lot of effort for such a trivial gain.
    >
    > The example given by the Original Poster is contrived. Nobody sensibly
    > writes an integer multiplication as a repeated addition like that, and
    > any real code that would need to be put in a for loop is almost certainly
    > going to be too complicated for the JIT compiler to benefit greatly. The
    > loop overhead itself will almost certainly be overwhelmed by the work
    > done in the loop:


    OK, you're saying there's no point in reducing the loop overhead, because
    the loop body is going to be even slower.

    All those compilers that offer loop unrolling are therefore wasting their
    time...

    --
    Bartc
     
    BartC, Sep 5, 2010
    #4
  5. BartC, 05.09.2010 19:09:
    > I've thought about it (writing an independent interpreter). But I don't
    > know enough of the language, and a lot of it I don't understand (eg.
    > OOP). Besides, I think the language itself deliberately makes it
    > difficult to get it up to speed. Some of the reasons might be the ones
    > set down here, in Chapter 2:
    >
    > http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf


    Second sentence of the abstract:

    """
    However, all the existing implementations
    prevent programmers from developing very efficient code.
    """

    For an incorrect statement, that's a pretty bold one, even in the context
    given by the abstract. The author is lucky not to be an English native
    speaker, which explains some of the offensive wording in the intro.

    Also, I'm impressed by an accepted Ph.D. thesis that comes along with just
    a bit more than four pages of references, and that fails to reference
    basically everything that the Python ecosystem provides for fast
    computation. I wouldn't mind a "faster Python", but as long as Python
    continues to be a language that allows you to very quickly get to the point
    where you can benchmark and optimise your code, and as long as CPython then
    makes it easy to do that, I won't be the one jumping up and down for a
    small factor of "general" improvement, especially when it comes at the
    price of switching to a completely different runtime. After all, it's
    really hard to appreciate that a program can now wait 5% more often on I/O
    than before. Larger performance improvements are usually due to algorithmic
    changes (including data structure adaptations and caching), rarely due to
    changes in the interpreter, especially when it's as old and well optimised
    as CPython.

    I think the CPython interpreter does a really good job in providing a
    stable and predictable runtime environment and executing code in it, and
    the CPython ecosystem does a really good job in providing tools that make
    code run fast that needs to, be it due to efficient usage of I/O, CPU,
    memory, or whatever.

    I'm not trying to put down the achievements of the author of the cited
    thesis, not at all. I'm sure the results are interesting for some people
    and for some kinds of applications, just like the different Python
    implementations are interesting for some people and some applications. But
    an awful lot of existing apps won't benefit at all from a fast CLI based
    Python implementation, simply because it doesn't have access (or at least
    no efficient access) to many of the existing tools in the CPython ecosystem.


    >> The simple fact is that there are far more important things for Python
    >> developers to spend their time and effort on than optimizations like
    >> this.

    >
    > I don't know, there's seem to be an awful lot of effort going into
    > addressing exactly this issue (eg. PyPy, Pscyo, Unladen Swallow,
    > Shedskin....)


    Those are very different efforts that address very different issues.


    > All those compilers that offer loop unrolling are therefore wasting
    > their time...


    Sometimes they do, yes.

    Stefan
     
    Stefan Behnel, Sep 5, 2010
    #5
  6. David Cournapeau

    BartC Guest

    "Stefan Behnel" <> wrote in message
    news:...
    > BartC, 05.09.2010 19:09:


    >> All those compilers that offer loop unrolling are therefore wasting
    >> their time...

    >
    > Sometimes they do, yes.


    Modifying the OP's code a little:

    a = 0
    for i in xrange(100000000): # 100 million
    a = a + 10 # add 10 or 100
    print a

    Manually unrolling such a loop four times (ie. 4 copies of the body, and
    counting only to 25 million) increased the speed by between 16% and 47% (ie.
    runtime reducing by between 14% and 32%).

    This depended on whether I added +10 or +100 (ie. whether long integers are
    needed), whether it was inside or outside a function, and whether I was
    running Python 2 or 3 (BTW why doesn't Python 3 just accept 'xrange' as a
    synonym for 'range'?)

    These are just some simple tests on my particular machine and
    implementations, but they bring up some points:

    (1) Loop unrolling does seem to have a benefit, when the loop body is small.

    (2) Integer arithmetic seems to go straight from 32-bits to long integers;
    why not use 64-bits before needing long integers?

    (3) Since the loop variable is never used, why not have a special loop
    statement that repeats code so many times? This can be very fast, since the
    loop counter need not be a Python object, and probably there would be no
    need for unrolling at all:

    repeat 100000000: # for example
    a = a + 10

    --
    Bartc
     
    BartC, Sep 6, 2010
    #6
  7. On Mon, Sep 6, 2010 at 4:08 PM, BartC <> wrote:
    > "Stefan Behnel" <> wrote in message
    > news:...
    >>
    >> BartC, 05.09.2010 19:09:

    >
    >>> All those compilers that offer loop unrolling are therefore wasting
    >>> their time...

    >>
    >> Sometimes they do, yes.

    >
    > Modifying the OP's code a little:
    >
    > a = 0
    > for i in xrange(100000000):      # 100 million
    >    a = a + 10                  # add 10 or 100
    > print a
    >
    > Manually unrolling such a loop four times (ie. 4 copies of the body, and
    > counting only to 25 million) increased the speed by between 16% and 47% (ie.
    > runtime reducing by between 14% and 32%).
    >
    > This depended on whether I added +10 or +100 (ie. whether long integers are
    > needed), whether it was inside or outside a function, and whether I was
    > running Python 2 or 3 (BTW why doesn't Python 3 just accept 'xrange' as a
    > synonym for 'range'?)
    >
    > These are just some simple tests on my particular machine and
    > implementations, but they bring up some points:
    >
    > (1) Loop unrolling does seem to have a benefit, when the loop body is small.
    >
    > (2) Integer arithmetic seems to go straight from 32-bits to long integers;
    > why not use 64-bits before needing long integers?
    >


    On 64-bit systems, integer arithmetic will go from 64-bit native
    integers to long. Will using any emulated 64-bit type on a 32-bit
    system actually be better than the python long implementation?

    >From my 64-bit linux system:


    In [1]: n = 2 ** 40

    In [2]: type(n)
    Out[2]: <type 'int'>

    In [3]: n = 2 ** 80

    In [4]: type(n)
    Out[4]: <type 'long'>


    > (3) Since the loop variable is never used, why not have a special loop
    > statement that repeats code so many times? This can be very fast, since the
    > loop counter need not be a Python object, and probably there would be no
    > need for unrolling at all:
    >
    > repeat 100000000:        # for example
    >   a = a + 10
    >


    --
    regards,
    kushal
     
    Kushal Kumaran, Sep 6, 2010
    #7
  8. On Mon, 06 Sep 2010 13:20:01 +0200
    Stefan Behnel <> wrote:
    >
    > > (2) Integer arithmetic seems to go straight from 32-bits to long
    > > integers; why not use 64-bits before needing long integers?

    >
    > You are making assumptions based on Python 2, I guess. Try Python 3.1 or
    > later instead, where the int and long types are unified. Also, the
    > implementation is a bit more complex than you appear to be thinking. Don't
    > forget that it has received serious benchmarking based optimisations.


    The optimizations are mainly related to big integer arithmetics,
    though. For small ints the main cost is interpretation and
    unboxing overhead as always.

    Regards

    Antoine.
     
    Antoine Pitrou, Sep 6, 2010
    #8
  9. On Mon, 06 Sep 2010 11:38:22 +0100, BartC wrote:

    > Modifying the OP's code a little:
    >
    > a = 0
    > for i in xrange(100000000): # 100 million
    > a = a + 10 # add 10 or 100
    > print a
    >
    > Manually unrolling such a loop four times (ie. 4 copies of the body, and
    > counting only to 25 million) increased the speed by between 16% and 47%
    > (ie. runtime reducing by between 14% and 32%).


    Or you could *really* optimize it, as well as simplifying the code, by
    writing:

    n = 100000000
    a = 10*n

    and doing a single multiplication rather than pointless repeated addition.

    (I assume that the number of loops is meant to be a variable rather than
    a constant, otherwise a = 1000000000 would be the correct optimization.)

    Of course, if 10 (or 100) is not a constant but is just standing in for
    something which varies each iteration:

    for i in xrange(100000000):
    a = a + f(i)


    then unrolling the loop is even less useful. The overhead of the loop
    itself is likely to be trivial compared to the cost of calling f() 100
    million times -- the added complexity to shave 3 seconds off a four
    minute calculation simply isn't worth it.

    Besides, loop unrolling really only is valuable for small loops,
    otherwise the overhead caused by the increased code size makes it a
    pessimation rather than an optimization. It's very easy for any gains to
    be lost due to increased cache misses, time needed to copy the code into
    memory, etc.

    http://en.wikipedia.org/wiki/Loop_unwinding

    There's a lot of subtlety in optimization, and what counts as an
    optimization for low-level operations, and what is an optimization for
    high-level languages like Python, are rarely the same.



    > This depended on whether I added +10 or +100 (ie. whether long integers
    > are needed), whether it was inside or outside a function, and whether I
    > was running Python 2 or 3 (BTW why doesn't Python 3 just accept 'xrange'
    > as a synonym for 'range'?)


    Why should it? But if you want it, you can do it:

    xrange = range

    There, that wasn't hard, was it?


    > These are just some simple tests on my particular machine and
    > implementations, but they bring up some points:
    >
    > (1) Loop unrolling does seem to have a benefit, when the loop body is
    > small.
    >
    > (2) Integer arithmetic seems to go straight from 32-bits to long
    > integers; why not use 64-bits before needing long integers?


    Why? That adds 50% more code, 50% more testing, 50% more places for bugs
    to hide, 50% more effort required to maintain it, for something which *at
    best* will be a trivial optimization which at best is of interest to a
    small minority of Python coders.

    There's already code to deal with 32 bit its, code to deal with longints,
    and code to deal with shifting transparently from one to the other.
    Adding code to deal with 64 bit ints doesn't happen for free.

    Besides, if you care about the difference between 32 and 64 bit ints,
    chances are you don't want Python blithely swapping from one to the other
    when you least expect it. So you'll be using a library that gives you
    access to whichever ints you want, probably implemented natively rather
    than as objects.

    Sounds rather like numpy really :)

    http://docs.scipy.org/doc/numpy/user/basics.types.html



    > (3) Since the loop variable is never used, why not have a special loop
    > statement that repeats code so many times? This can be very fast, since
    > the loop counter need not be a Python object, and probably there would
    > be no need for unrolling at all:
    >
    > repeat 100000000: # for example
    > a = a + 10


    Because now both the parser and all Python coders need to care about one
    more reserved word, all so that one Python program in ten thousand can
    save 0.1 ms occasionally.

    Optimizations don't happen for free. If the optimization doesn't carry
    it's own weight, it's a pessimation. To me, it's more important to be
    able to be able to use repeat as a name:

    connect_to_server("localhost", repeat=10)

    than to save a millisecond or so.


    Besides, if repeat were to become syntax, then I'd MUCH rather have it
    used for repeat...until (or repeat...while) loops, to avoid the anti-
    pattern of:


    x = f()
    while not condition(x):
    x = f(x)


    which would be better as:


    repeat:
    x = f()
    until condition(x)





    --
    Steven
     
    Steven D'Aprano, Sep 7, 2010
    #9
  10. David Cournapeau

    Aahz Guest

    In article <o4oho.85508$2>,
    BartC <> wrote:
    >"Steven D'Aprano" <> wrote in message
    >news:4c85adfe$0$11115$...
    >>
    >> xrange = range
    >>
    >> There, that wasn't hard, was it?

    >
    >I think I just learned more about Python than from months of reading this
    >group.
    >
    >So 'range' is just a class like any other. And that a class is something you
    >can blithely copy from one variable to another. And whenever you see 'range'
    >anywhere, you can't always be certain that someone hasn't done:
    >
    >range = 42
    >
    >at some point. That explains a lot about the difficulties of implementing
    >Python efficiently. (And the xrange=range trick works well thanks.)


    Actually, range() is a function. But the same point applies, squared --
    you really can never know what kind of object is hiding behind a name in
    the general case.
    --
    Aahz () <*> http://www.pythoncraft.com/

    "...if I were on life-support, I'd rather have it run by a Gameboy than a
    Windows box." --Cliff Wells
     
    Aahz, Sep 7, 2010
    #10
  11. David Cournapeau

    Aahz Guest

    In article <>,
    Roy Smith <> wrote:
    >
    >Imagine that you're looking at some code which was written years ago, by
    >people who are no longer around to answer questions. In one place, you
    >see:
    >
    >for i in range(n):
    > blah
    >
    >and in another, you see:
    >
    >for j in xrange(n):
    > blah
    >
    >If you are truly a Python expert, you'll say to yourself, "range and
    >xrange are synonyms", and move on to other things. If, however, you're
    >not really an expert, you'll look at this and say, "Hmmm, in one place
    >they used range(), and in another they used xrange(). Clearly, there's
    >some reason for the difference, I need to figure out what it is, because
    >that's probably key to my understanding why this code isn't working".
    >So, you spend the next two hours pouring over reference manuals trying
    >to understand the subtle difference, until your friend comes over and
    >says, "You dolt, you just wasted half the afternoon. They're the same
    >thing. Move on, this is not the bug you're looking for".


    ....and if you're a Python guru, you might spend a little bit of time
    trying to figure out if the range() is causing the problem due to
    allocating a large chunk of memory....
    --
    Aahz () <*> http://www.pythoncraft.com/

    "...if I were on life-support, I'd rather have it run by a Gameboy than a
    Windows box." --Cliff Wells
     
    Aahz, Sep 7, 2010
    #11
  12. On Sun, Sep 5, 2010 at 8:28 PM, BartC <> wrote:

    >
    > One order of magnitude (say 10-20x slower) wouldn't be so bad. That's what
    > you might expect for a dynamically typed, interpreted language.


    10/20x slower than C is only reached by extremely well optimized
    dynamic languages. It would be a tremendous achievement. If that's
    what you are after, look at LUA with its JIT, or scheme + stalin.

    For cases where vectorization is indeed not applicable (recursive
    algorithms), like in some signal processing, there are specialized
    tools which are very expressive while retaining good speed (faust is
    an interesting one for audio signal processing).

    > That would simply be delegating Python to a scripting language.


    That's a strange thing to say if you compare it to matlab.

    > It would be
    > nice if you could directly code low-level algorithms in it without relying
    > on accelerators


    It would be nice, but the fact is that python cannot do it - and is
    quite far from being able to do it. I don't think it is as important
    as you think it is, because things like numpy are extremely powerful
    in many cases.

    cheers,

    David
     
    David Cournapeau, Sep 7, 2010
    #12
  13. David Cournapeau

    Terry Reedy Guest

    On 9/7/2010 6:00 AM, BartC wrote:

    >> Why should it? But if you want it, you can do it:
    >>
    >> xrange = range
    >>
    >> There, that wasn't hard, was it?

    >
    > I think I just learned more about Python than from months of reading this
    > group.
    >
    > So 'range' is just a class like any other. And that a class is something
    > you
    > can blithely copy from one variable to another.


    There is no copying of the class object. A new name is associated with
    the object. Any object can be given any number of names (aliases) that
    refer to the one and same object.

    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 8, 2010
    #13
  14. David Cournapeau

    BartC Guest

    "David Cournapeau" <> wrote in message
    news:...
    > On Sun, Sep 5, 2010 at 8:28 PM, BartC <> wrote:
    >
    >>
    >> One order of magnitude (say 10-20x slower) wouldn't be so bad. That's
    >> what
    >> you might expect for a dynamically typed, interpreted language.

    >
    > 10/20x slower than C is only reached by extremely well optimized
    > dynamic languages. It would be a tremendous achievement. If that's


    Well, that is what I do (mess around with languages and stuff).

    Getting back to the OP's code again (trivial and pointless as it might
    seem), I got these results:

    C (gcc 3.4.5 -O3) 0.8 secs
    C (DMC-o) 2.3 secs
    C (lccwin32 -O) 2.9 secs
    My last interpreter 12.6 secs dynamically typed language
    (or 4.5 secs when told the type of 'a'; but that's
    cheating a little..)
    Python 3 177.0 secs

    That's why I was questioning the latter's performance in for-loops. But now
    that I know a bit more about Python (having dynamic everything) the figure
    is not so surprising. However, it's still slow!

    > what you are after, look at LUA with its JIT, or scheme + stalin.


    I've seen LuaJIT in action. It's timing for this test is 1.5 secs: forget
    being only 10x slower than C, it's faster than some C versions! (I'm sure it
    must be cheating somewhere...)

    --
    bartc
     
    BartC, Sep 8, 2010
    #14
  15. David Cournapeau

    MRAB Guest

    On 08/09/2010 02:45, BartC wrote:
    >
    >
    > "David Cournapeau" <> wrote in message
    > news:...
    >> On Sun, Sep 5, 2010 at 8:28 PM, BartC <> wrote:
    >>
    >>>
    >>> One order of magnitude (say 10-20x slower) wouldn't be so bad. That's
    >>> what
    >>> you might expect for a dynamically typed, interpreted language.

    >>
    >> 10/20x slower than C is only reached by extremely well optimized
    >> dynamic languages. It would be a tremendous achievement. If that's

    >
    > Well, that is what I do (mess around with languages and stuff).
    >
    > Getting back to the OP's code again (trivial and pointless as it might
    > seem), I got these results:
    >
    > C (gcc 3.4.5 -O3) 0.8 secs
    > C (DMC-o) 2.3 secs
    > C (lccwin32 -O) 2.9 secs
    > My last interpreter 12.6 secs dynamically typed language
    > (or 4.5 secs when told the type of 'a'; but that's
    > cheating a little..)
    > Python 3 177.0 secs
    >
    > That's why I was questioning the latter's performance in for-loops. But now
    > that I know a bit more about Python (having dynamic everything) the figure
    > is not so surprising. However, it's still slow!
    >
    >> what you are after, look at LUA with its JIT, or scheme + stalin.

    >
    > I've seen LuaJIT in action. It's timing for this test is 1.5 secs:
    > forget being only 10x slower than C, it's faster than some C versions!
    > (I'm sure it must be cheating somewhere...)
    >

    If you'd like to direct your skills to making CPython faster, without
    diminishing its flexibility, I'm sure it'll be welcomed. The source is
    all public, you know! :)
     
    MRAB, Sep 8, 2010
    #15
  16. BartC, 08.09.2010 03:45:
    > Getting back to the OP's code again (trivial and pointless as it might
    > seem), I got these results:
    >
    > C (gcc 3.4.5 -O3) 0.8 secs
    > C (DMC-o) 2.3 secs
    > C (lccwin32 -O) 2.9 secs
    >[...]
    > I've seen LuaJIT in action. It's timing for this test is 1.5 secs:
    > forget being only 10x slower than C, it's faster than some C versions!
    > (I'm sure it must be cheating somewhere...)


    Sure it does. C is statically compiled, while LuaJIT is a JIT compiler. It
    unjustly uses *runtime* information to compile the code. You can get a
    similar advantage with some C compilers by using profile based optimisation.

    BTW, I wonder why the code takes a whole 0.8 seconds to run in your gcc
    test. Maybe you should use a newer GCC version. It shouldn't take more than
    a couple of milliseconds (for program startup, OS calls, etc.), given that
    the output is a constant.

    Stefan
     
    Stefan Behnel, Sep 8, 2010
    #16
  17. BartC wrote:
    > So 'range' is just a class like any other. And that a class is something
    > you can blithely copy from one variable to another. And whenever you see
    > 'range' anywhere, you can't always be certain that someone hasn't done:
    >
    > range = 42
    >
    > at some point.


    True. I read an explanation here that IMHO pretty well explains what's going
    on: The "range" above is a label, attached with a piece of string to an
    object. The object in this case is the integer 42. The same object can have
    multiple labels attached to it, so "foo = bar" will just create a new
    label "foo" and attach it to the object that the label "bar" is already
    attached to.

    Note that I said object, which includes class instances like the int 42
    above, but also the classes themselves, functions (or bound functions[1])
    and modules (I hope I didn't miss any).

    > That explains a lot about the difficulties of implementing Python
    > efficiently.


    Yes, "range" is not a keyword or something intrinsic to the interpreter. It
    is just a name that is looked up whenever it is encountered, and that costs
    time. However, you also get some nice flexibility at that cost.

    Cheers!

    Uli

    [1] bound function = class function where the instance parameter is bound.
    Example:

    x = []
    a = x.append
    a(42) # calls x.append(42)

    --
    Sator Laser GmbH
    Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
     
    Ulrich Eckhardt, Sep 8, 2010
    #17
  18. On Tue, 07 Sep 2010 11:00:03 +0100, BartC wrote:

    >> for i in xrange(100000000):
    >> a = a + f(i)
    >>
    >> then unrolling the loop is even less useful. The overhead of the loop
    >> itself is likely to be trivial compared to the cost of calling f() 100
    >> million times -- the added complexity to shave 3 seconds off a four
    >> minute calculation simply isn't worth it.

    >
    > With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
    > speed by 15%; 4.00 minutes reduces to 3.30 minutes.



    I'm afraid that I can't replicate those figures. In my test, unrolling
    the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my
    test code:


    def f(x):
    return x+1

    def test_loop(n):
    a = 0
    for i in range(n):
    a += f(i)
    return a

    def test_unrolled(n):
    a = 0
    for i in range(n//4):
    a += f(4*i)
    a += f(4*i+1)
    a += f(4*i+2)
    a += f(4*i+3)
    return a


    from timeit import Timer
    n = 10000000
    assert test_loop(n) == test_unrolled(n)
    t1 = Timer('test_loop(n)',
    'from __main__ import test_loop; n=%d' % n)
    t2 = Timer('test_unrolled(n)',
    'from __main__ import test_unrolled; n=%d' % n)

    And here are the results using Python 3.1. Times are the best of 5 for 10
    calls each to the loop_* functions, results given in seconds:

    >>> min(t1.repeat(number=10, repeat=5))/10

    5.97409288883209
    >>> min(t2.repeat(number=10, repeat=5))/10

    8.25656900405883


    I repeated it with a larger loop variable. Since the time per test was so
    large (over ten minutes per call on my machine!), I didn't see the need
    to run multiple trials:

    n *= 100
    assert test_loop(n) == test_unrolled(n)
    t3 = Timer('test_loop(n)',
    'from __main__ import test_loop; n=%d' % n)
    t4 = Timer('test_unrolled(n)',
    'from __main__ import test_unrolled; n=%d' % n)

    And the results:

    >>> t3.timeit(number=1)

    653.3572518825531
    >>> t4.timeit(number=1)

    864.6454808712006

    That's slightly better (32% slower instead of 37% slower), but still a
    massive performance hit. Given these results, I'm prepared to say that
    loop unrolling in Python is almost certainly going to be a pessimation,
    not an optimization, no matter what you have inside the loop.



    --
    Steven
     
    Steven D'Aprano, Sep 8, 2010
    #18
  19. Steven D'Aprano <> writes:

    >> With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
    >> speed by 15%; 4.00 minutes reduces to 3.30 minutes.


    > I'm afraid that I can't replicate those figures. In my test, unrolling
    > the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my
    > test code:
    >
    > def f(x):
    > return x+1
    >
    > def test_loop(n):
    > a = 0
    > for i in range(n):
    > a += f(i)
    > return a
    >
    > def test_unrolled(n):
    > a = 0
    > for i in range(n//4):
    > a += f(4*i)
    > a += f(4*i+1)
    > a += f(4*i+2)
    > a += f(4*i+3)
    > return a
    >
    > from timeit import Timer
    > n = 10000000
    > assert test_loop(n) == test_unrolled(n)
    > t1 = Timer('test_loop(n)',
    > 'from __main__ import test_loop; n=%d' % n)
    > t2 = Timer('test_unrolled(n)',
    > 'from __main__ import test_unrolled; n=%d' % n)
    >
    > And here are the results using Python 3.1. Times are the best of 5 for 10
    > calls each to the loop_* functions, results given in seconds:
    >
    >>>> min(t1.repeat(number=10, repeat=5))/10

    > 5.97409288883209
    >>>> min(t2.repeat(number=10, repeat=5))/10

    > 8.25656900405883


    Running this test with python 2.6 (on my laptop) led to:

    >>> min(t1.repeat(number=10, repeat=5))/10

    2.10715539455
    >>> min(t2.repeat(number=10, repeat=5))/10

    2.43037149906

    That's a 15% slowdown. Which is reasonable since you add four multiplies
    in the loop body. Changing your unrolled loop to:

    def test_unrolled(n):
    a = 0
    for i in range(n//4):
    b = 4*i
    a += f(b)
    a += f(b+1)
    a += f(b+2)
    a += f(b+3)
    return a

    makes both versions run in approximately the same time (2.135 vs.
    2.136).

    > That's slightly better (32% slower instead of 37% slower), but still a
    > massive performance hit. Given these results, I'm prepared to say that
    > loop unrolling in Python is almost certainly going to be a pessimation,
    > not an optimization, no matter what you have inside the loop.


    I don't really see why it should be the case. Do you have any idea?

    I don't think either that it should speed things up significantly. Loop
    unrolling in binary code is relevant mostly because it allows better
    instruction scheduling (i.e., scheduling has fewer constraints in
    longer instruction sequences). Python programs are way too far from
    binary code for scheduling opts to apply.

    -- Alain.
     
    Alain Ketterlin, Sep 8, 2010
    #19
  20. David Cournapeau

    BartC Guest

    "Steven D'Aprano" <> wrote in message
    news:4c878be5$0$11113$...
    > On Tue, 07 Sep 2010 11:00:03 +0100, BartC wrote:
    >
    >>> for i in xrange(100000000):
    >>> a = a + f(i)


    >> With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
    >> speed by 15%; 4.00 minutes reduces to 3.30 minutes.

    >
    >
    > I'm afraid that I can't replicate those figures. In my test, unrolling
    > the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my
    > test code:


    You're absolutely right. I completely forgot about modulating the i index
    for each duplicated line.

    > def test_unrolled(n):
    > a = 0
    > for i in range(n//4):
    > a += f(4*i)
    > a += f(4*i+1)
    > a += f(4*i+2)
    > a += f(4*i+3)
    > return a


    Although tidying up your calculations (as already posted) gives code that is
    neither faster nor slower.

    I'd hoped that using the following would help, but did nothing in Python 3,
    and gave only 8-10% speedup in Python 2:

    for i in xrange(0,n,4):
    a=a+f(i)
    a=a+f(i+1)
    a=a+f(i+2)
    a=a+f(i+3)

    (On my other example of setting list elements to 0, this did improve speed
    by some 10% in Python 3, and 28% in '2'.)

    So using manual unrolling for an indexed loop is not going to be the right
    approach (it's also fiddly, and there is the problem of N not being a
    multiple of 4 or whatever).

    We're trying to avoid the iteration overhead, yet we're adding it in the
    code anyway, and as user-level code. But, I still think that internal
    support for such a thing might be worthwhile, when it can make certain
    assumptions about the loop range and index type.

    --
    Bartc
     
    BartC, Sep 8, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ham

    I need speed Mr .Net....speed

    Ham, Oct 28, 2004, in forum: ASP .Net
    Replies:
    6
    Views:
    2,366
    Antony Baula
    Oct 29, 2004
  2. efiedler
    Replies:
    1
    Views:
    2,111
    Tim Ward
    Oct 9, 2003
  3. Michael Kreim

    Speed-up for loops

    Michael Kreim, Sep 2, 2010, in forum: Python
    Replies:
    12
    Views:
    2,192
    Stefan Behnel
    Sep 5, 2010
  4. Tim Wintle

    Re: Speed-up for loops

    Tim Wintle, Sep 2, 2010, in forum: Python
    Replies:
    5
    Views:
    329
    Tim Wintle
    Sep 3, 2010
  5. Me
    Replies:
    2
    Views:
    260
Loading...

Share This Page