Re: Is LOAD_GLOBAL really that slow?

Discussion in 'Python' started by Carsten Haese, Aug 30, 2007.

  1. On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
    > It seems a common opinion that global access is much slower than local
    > variable access. However, my benchmarks show a relatively small
    > difference:
    >
    > ./python -m timeit -r 10 -v -s 'x = [None] * 10000
    > def foo():
    > for i in x:
    > list; list; list; list; list; list; list; list; list; list' 'foo()'
    > 10 loops -> 0.0989 secs100 loops -> 0.991 secs
    > raw times: 0.999 0.985 0.987 0.985 0.985 0.982 0.982 0.982 0.981 0.985
    > 100 loops, best of 10: 9.81 msec per loop
    >
    > ./python -m timeit -r 10 -v -s 'x = [None] * 10000
    > def foo():
    > mylist = list
    > for i in x:
    > mylist; mylist; mylist; mylist; mylist; mylist; mylist; mylist;
    > mylist; mylist' 'foo()'
    > 10 loops -> 0.0617 secs
    > 100 loops -> 0.61 secs
    > raw times: 0.603 0.582 0.582 0.583 0.581 0.583 0.58 0.583 0.584 0.582
    > 100 loops, best of 10: 5.8 msec per loop
    >
    > So global access is about 70% slower than local variable access. To
    > put that in perspective, two local variable accesses will take longer
    > than a single global variable access.
    >
    > This is a very extreme benchmark though. In practice, other overheads
    > will probably drop the difference to a few percent at most. Not that
    > important in my book.


    Your comparison is flawed, because the function call and the inner for
    loop cause a measurement offset that makes the locals advantage seems
    smaller than it is. In the interest of comparing the times for just the
    local lookup versus just the global lookup, I think the following
    timings are more appropriate:

    $ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): pass" "f(42)"
    1000000 loops, best of 10: 0.3 usec per loop
    $ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): x" "f(42)"
    1000000 loops, best of 10: 0.331 usec per loop
    $ python2.5 -mtimeit -r 10 -s"y=42" -s"def f(x): y" "f(42)"
    1000000 loops, best of 10: 0.363 usec per loop

    There is no loop overhead here, and after subtracting the function call
    overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
    global lookup, so local lookups are just about twice as fast as global
    lookups.

    True, whether this difference is significant does depend on how many
    name lookups your code makes and how much else it's doing, but if you're
    doing a lot of number crunching and not a lot of I/O, the difference
    might be significant. Also, even if using local names is only slightly
    faster than using globals, it's still not slower, and the resulting code
    is still more readable and more maintainable. Using locals is a win-win
    scenario.

    Hope this helps,

    --
    Carsten Haese
    http://informixdb.sourceforge.net
    Carsten Haese, Aug 30, 2007
    #1
    1. Advertising

  2. On Aug 29, 8:33 pm, Carsten Haese <> wrote:
    > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
    > > It seems a common opinion that global access is much slower than local
    > > variable access. However, my benchmarks show a relatively small
    > > difference:

    >
    > > ./python -m timeit -r 10 -v -s 'x = [None] * 10000
    > > def foo():
    > > for i in x:
    > > list; list; list; list; list; list; list; list; list; list' 'foo()'
    > > 10 loops -> 0.0989 secs100 loops -> 0.991 secs
    > > raw times: 0.999 0.985 0.987 0.985 0.985 0.982 0.982 0.982 0.981 0.985
    > > 100 loops, best of 10: 9.81 msec per loop

    >
    > > ./python -m timeit -r 10 -v -s 'x = [None] * 10000
    > > def foo():
    > > mylist = list
    > > for i in x:
    > > mylist; mylist; mylist; mylist; mylist; mylist; mylist; mylist;
    > > mylist; mylist' 'foo()'
    > > 10 loops -> 0.0617 secs
    > > 100 loops -> 0.61 secs
    > > raw times: 0.603 0.582 0.582 0.583 0.581 0.583 0.58 0.583 0.584 0.582
    > > 100 loops, best of 10: 5.8 msec per loop

    >
    > > So global access is about 70% slower than local variable access. To
    > > put that in perspective, two local variable accesses will take longer
    > > than a single global variable access.

    >
    > > This is a very extreme benchmark though. In practice, other overheads
    > > will probably drop the difference to a few percent at most. Not that
    > > important in my book.

    >
    > Your comparison is flawed, because the function call and the inner for
    > loop cause a measurement offset that makes the locals advantage seems
    > smaller than it is. In the interest of comparing the times for just the
    > local lookup versus just the global lookup, I think the following
    > timings are more appropriate:


    That's why I used far more name lookups, to minimize the overhead.


    > $ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): pass" "f(42)"
    > 1000000 loops, best of 10: 0.3 usec per loop
    > $ python2.5 -mtimeit -r10 -s"y=42" -s"def f(x): x" "f(42)"
    > 1000000 loops, best of 10: 0.331 usec per loop
    > $ python2.5 -mtimeit -r 10 -s"y=42" -s"def f(x): y" "f(42)"
    > 1000000 loops, best of 10: 0.363 usec per loop


    On my box, the best results I got after several runs were 0.399,
    0.447, 0464. Even less difference than my original results.


    > There is no loop overhead here, and after subtracting the function call
    > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
    > global lookup, so local lookups are just about twice as fast as global
    > lookups.
    >
    > True, whether this difference is significant does depend on how many
    > name lookups your code makes and how much else it's doing, but if you're
    > doing a lot of number crunching and not a lot of I/O, the difference
    > might be significant. Also, even if using local names is only slightly
    > faster than using globals, it's still not slower, and the resulting code
    > is still more readable and more maintainable. Using locals is a win-win
    > scenario.


    You get very small speed gains (assuming your code is doing anything
    significant), for a lot of effort (trying out different options,
    seeing if they're actually faster on different boxes.) The
    readability cost is there, even if it is smaller than many of the
    other obfuscations people attempt. If the speed gains were really
    that important you should rewrite in C, where you'd get far greater
    speed gains.

    So it only seems worthwhile when you really, *really* need to get a
    slight speedup on your box, you don't need to get any more speedup
    than that, and C is not an option.

    Fwiw, I posted this after developing yet another patch to optimize
    global lookups. It does sometimes show an improvement on specific
    benchmarks, but overall it harms performance. Looking into why, it
    doesn't make sense that a python dictionary lookup can have less cost
    than two simple array indexes, but there you go. Python dictionaries
    are already damn fast.

    --
    Adam Olsen, aka Rhamphoryncus
    Rhamphoryncus, Aug 30, 2007
    #2
    1. Advertising

  3. Carsten Haese

    Chris Mellon Guest

    On 8/30/07, Rhamphoryncus <> wrote:
    > On Aug 29, 8:33 pm, Carsten Haese <> wrote:
    > > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
    > > There is no loop overhead here, and after subtracting the function call
    > > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
    > > global lookup, so local lookups are just about twice as fast as global
    > > lookups.
    > >


    __builtins__ lookups are an extra dict lookup slower than just global
    variables, too. Don't forget those.


    > > True, whether this difference is significant does depend on how many
    > > name lookups your code makes and how much else it's doing, but if you're
    > > doing a lot of number crunching and not a lot of I/O, the difference
    > > might be significant. Also, even if using local names is only slightly
    > > faster than using globals, it's still not slower, and the resulting code
    > > is still more readable and more maintainable. Using locals is a win-win
    > > scenario.

    >
    > You get very small speed gains (assuming your code is doing anything
    > significant), for a lot of effort (trying out different options,
    > seeing if they're actually faster on different boxes.) The
    > readability cost is there, even if it is smaller than many of the
    > other obfuscations people attempt. If the speed gains were really
    > that important you should rewrite in C, where you'd get far greater
    > speed gains.
    >


    I've doubled the speed of a processing loop by moving globals lookups
    out of the loop. Rewriting in C would have taken at least a day, even
    with Pyrex, localizing the lookup took about 2 minutes.

    > So it only seems worthwhile when you really, *really* need to get a
    > slight speedup on your box, you don't need to get any more speedup
    > than that, and C is not an option.
    >


    It's not a huge optimization, but it's really easy to write if you
    don't mind adding fake kwargs to your functions. Just for the heck of
    it I also wrote a decorator that will re-write the bytecode so that
    any global that can be looked up at function definition will be
    re-written as a local (actually with LOAD_CONST). You can see it at
    http://code.google.com/p/wxpsvg/wiki/GlobalsOptimization. Disclaimer:
    While I've tested it with a variety of functions and it's never broken
    anything, I've never actually used this for anything except an
    intellectual exercise. Use at your own risk.

    > Fwiw, I posted this after developing yet another patch to optimize
    > global lookups. It does sometimes show an improvement on specific
    > benchmarks, but overall it harms performance. Looking into why, it
    > doesn't make sense that a python dictionary lookup can have less cost
    > than two simple array indexes, but there you go. Python dictionaries
    > are already damn fast.
    >


    I certainly believe that changes to pythons internals to try to make
    LOAD_GLOBAL itself faster can be difficult, with even "obvious"
    optimizations ending up slower. However, LOAD_FAST (and LOAD_CONST)
    are faster than LOAD_GLOBAL and, for the reason you just stated, is
    unlikely to change.
    Chris Mellon, Aug 30, 2007
    #3
  4. On Aug 30, 12:04 pm, "Chris Mellon" <> wrote:
    > On 8/30/07, Rhamphoryncus <> wrote:
    >
    > > On Aug 29, 8:33 pm, Carsten Haese <> wrote:
    > > > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
    > > > There is no loop overhead here, and after subtracting the function call
    > > > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
    > > > global lookup, so local lookups are just about twice as fast as global
    > > > lookups.

    >
    > __builtins__ lookups are an extra dict lookup slower than just global
    > variables, too. Don't forget those.


    Heh right, I forgot that. That's what my benchmark was actually
    testing.


    > > > True, whether this difference is significant does depend on how many
    > > > name lookups your code makes and how much else it's doing, but if you're
    > > > doing a lot of number crunching and not a lot of I/O, the difference
    > > > might be significant. Also, even if using local names is only slightly
    > > > faster than using globals, it's still not slower, and the resulting code
    > > > is still more readable and more maintainable. Using locals is a win-win
    > > > scenario.

    >
    > > You get very small speed gains (assuming your code is doing anything
    > > significant), for a lot of effort (trying out different options,
    > > seeing if they're actually faster on different boxes.) The
    > > readability cost is there, even if it is smaller than many of the
    > > other obfuscations people attempt. If the speed gains were really
    > > that important you should rewrite in C, where you'd get far greater
    > > speed gains.

    >
    > I've doubled the speed of a processing loop by moving globals lookups
    > out of the loop. Rewriting in C would have taken at least a day, even
    > with Pyrex, localizing the lookup took about 2 minutes.


    I'm guessing that was due to deep voodoo involving your processor's
    pipeline, branch prediction, caching, etc. I'd be interested in
    seeing small examples of it though.


    > > So it only seems worthwhile when you really, *really* need to get a
    > > slight speedup on your box, you don't need to get any more speedup
    > > than that, and C is not an option.

    >
    > It's not a huge optimization, but it's really easy to write if you
    > don't mind adding fake kwargs to your functions. Just for the heck of
    > it I also wrote a decorator that will re-write the bytecode so that
    > any global that can be looked up at function definition will be
    > re-written as a local (actually with LOAD_CONST). You can see it athttp://code.google.com/p/wxpsvg/wiki/GlobalsOptimization. Disclaimer:
    > While I've tested it with a variety of functions and it's never broken
    > anything, I've never actually used this for anything except an
    > intellectual exercise. Use at your own risk.


    Doubling the throughput while doing *real work* is definitely more
    significant than maybe-or-maybe-not-quite-doubling without any real
    work.

    --
    Adam Olsen, aka Rhamphoryncus
    Rhamphoryncus, Aug 30, 2007
    #4
  5. Carsten Haese

    Chris Mellon Guest

    On 8/30/07, Rhamphoryncus <> wrote:
    > On Aug 30, 12:04 pm, "Chris Mellon" <> wrote:
    > > On 8/30/07, Rhamphoryncus <> wrote:
    > >
    > > > On Aug 29, 8:33 pm, Carsten Haese <> wrote:
    > > > > On Wed, 2007-08-29 at 19:23 -0600, Adam Olsen wrote:
    > > > > There is no loop overhead here, and after subtracting the function call
    > > > > overhead, I get 31 nanoseconds per local lookup and 63 nanoseconds per
    > > > > global lookup, so local lookups are just about twice as fast as global
    > > > > lookups.

    > >
    > > __builtins__ lookups are an extra dict lookup slower than just global
    > > variables, too. Don't forget those.

    >
    > Heh right, I forgot that. That's what my benchmark was actually
    > testing.
    >
    >
    > > > > True, whether this difference is significant does depend on how many
    > > > > name lookups your code makes and how much else it's doing, but if you're
    > > > > doing a lot of number crunching and not a lot of I/O, the difference
    > > > > might be significant. Also, even if using local names is only slightly
    > > > > faster than using globals, it's still not slower, and the resulting code
    > > > > is still more readable and more maintainable. Using locals is a win-win
    > > > > scenario.

    > >
    > > > You get very small speed gains (assuming your code is doing anything
    > > > significant), for a lot of effort (trying out different options,
    > > > seeing if they're actually faster on different boxes.) The
    > > > readability cost is there, even if it is smaller than many of the
    > > > other obfuscations people attempt. If the speed gains were really
    > > > that important you should rewrite in C, where you'd get far greater
    > > > speed gains.

    > >
    > > I've doubled the speed of a processing loop by moving globals lookups
    > > out of the loop. Rewriting in C would have taken at least a day, even
    > > with Pyrex, localizing the lookup took about 2 minutes.

    >
    > I'm guessing that was due to deep voodoo involving your processor's
    > pipeline, branch prediction, caching, etc. I'd be interested in
    > seeing small examples of it though.
    >


    There's certainly deep voodoo involved in the function. For example, I
    used to have a test for early exit but the branch was slower than the
    extra 3 function calls prevented by the branch. Very surprised to get
    that result at such a high level, but I tested (over and over) and it
    was very consistent. This was with psyco as well, so perhaps some
    characteristic of the generated assembly.

    Psyco of course is even less work and more benefit than localizing
    globals, so thats my first step for "make it faster".

    >
    > > > So it only seems worthwhile when you really, *really* need to get a
    > > > slight speedup on your box, you don't need to get any more speedup
    > > > than that, and C is not an option.

    > >
    > > It's not a huge optimization, but it's really easy to write if you
    > > don't mind adding fake kwargs to your functions. Just for the heck of
    > > it I also wrote a decorator that will re-write the bytecode so that
    > > any global that can be looked up at function definition will be
    > > re-written as a local (actually with LOAD_CONST). You can see it athttp://code.google.com/p/wxpsvg/wiki/GlobalsOptimization. Disclaimer:
    > > While I've tested it with a variety of functions and it's never broken
    > > anything, I've never actually used this for anything except an
    > > intellectual exercise. Use at your own risk.

    >
    > Doubling the throughput while doing *real work* is definitely more
    > significant than maybe-or-maybe-not-quite-doubling without any real
    > work.
    >


    I wouldn't actually recommend that anyone use this, I just wrote it for fun.
    Chris Mellon, Aug 30, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Amir
    Replies:
    3
    Views:
    585
  2. Replies:
    3
    Views:
    3,011
  3. nc
    Replies:
    1
    Views:
    486
    nice.guy.nige
    Feb 3, 2005
  4. Adam Olsen

    Is LOAD_GLOBAL really that slow?

    Adam Olsen, Aug 30, 2007, in forum: Python
    Replies:
    0
    Views:
    266
    Adam Olsen
    Aug 30, 2007
  5. Mat Schaffer

    String#chop slow? REALLY slow?

    Mat Schaffer, Jul 27, 2006, in forum: Ruby
    Replies:
    11
    Views:
    320
    Caio Chassot
    Jul 27, 2006
Loading...

Share This Page