Speed: bytecode vz C API calls

  • Thread starter Jacek Generowicz
  • Start date
T

Terry Reedy

You presented an example function as taking a type (including new classes)
as parameter. If your type domain is boundedly finite and preknowable*, you
could precompute the function for *all* inputs and replace the call with a
dict lookup (or bound method for map).

(*) If 'preknowable', could either create classes with factory function that
adds them to cache or give metaclass that does same.

Terry
 
A

Aahz

I tried something similar, namely inlining the

try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))

rather than using proxy. This gives a little over a factor of 3
speedup, but it's not something I can use everywhere, as often the
proxy is called inside map(...)

Why are you using map()? Generally speaking, map() won't be any faster
than a for loop.
 
A

Aahz

[Too busy to snip, sorry]

^^^^^^^^

(Maybe this is where we are misunderstanding eachother. On first
reading, I thought that "memoizer" was just a typo. I don't return a
memoizer function, I return a memoized function.)
Same way you decide which function to call currently.

The way I decide which function to call currently is to type its name
directly into the source at the call site. Deciding which function to
call is not the problem: calling any function at all, once I've
reduced myself to having _just_ a dictionary, is the problem.

Maybe I didn't make the use of "memoize" clear enough.

You use it thus:

def foo():
...

foo = memoize(foo)

And from now on foo is a lot faster than it was before[*].

That's why I changed the Subject: to "Function unrolling"; you're going
to need to change your code to avoid function calls in inner loops. One
technique you might try is to create a generalized inner loop function:

def innerloop(iterable, func, cache):
l = []
for args in iterable:
try:
l.append(cache[args])
except KeyError:
l.append(cache.setdefault(args, func(args))

return l

Yes, it's a performance hack, but I bet it's a lot clearer than C++
code.
 
J

Jacek Generowicz

Generally speaking, map() won't be any faster than a for loop.

This is at odds with my own experience (including a lot of timing runs
in the last two days), countless posts on this newsgroup over the
years, and, for exapmle, this article

http://www.python.org/doc/essays/list2str.html

by Guido, in which he concludes that

"An implied loop in map() is faster than an explicit for loop"
 
J

Jacek Generowicz

That's why I changed the Subject: to "Function unrolling"; you're going
to need to change your code to avoid function calls in inner loops. One
technique you might try is to create a generalized inner loop function:

def innerloop(iterable, func, cache):
l = []
for args in iterable:
try:
l.append(cache[args])
except KeyError:
l.append(cache.setdefault(args, func(args))

return l

OK, I see what you mean.

I agree, I can do this, in some places, and toy tests suggest that
getting rid of the function call could give me a factor of 3
improvement. I'll have to change the code in many places, but that's
simpler than going to C.

(However, we disagree on the specific formula you give above: I think
that map is faster: see other post in the thread. [A quick test
suggests that a map is faster than the loop by a factor of 2.])
Yes, it's a performance hack, but I bet it's a lot clearer than C++
code.

No doubt about that :)


Cheers,
 
J

Jacek Generowicz

Terry Reedy said:
You presented an example function as taking a type (including new classes)
as parameter. If your type domain is boundedly finite

It's unboundedly finite :)
and preknowable*, you could precompute the function for *all* inputs
and replace the call with a dict lookup (or bound method for map).

(*) If 'preknowable', could either create classes with factory function that
adds them to cache or give metaclass that does same.

Hmmm ... all the non-builtin types in my domain are actually
dynamically created, and are themselves instances of a metaclass,
which could be instructed to add information about the new classes, to
the caches, as the classes are being created.

However, there are a number of problems I foresee (which I don't think
are particularly interesting to discuss here). Still, it could be put
to good use in some of the cases. I'll bear it in mind, thanks.

But first, I think that I'll try inlining the exception handler in
those cases where it can be done.


Cheers,
 
A

Aahz

This is at odds with my own experience (including a lot of timing runs
in the last two days), countless posts on this newsgroup over the
years, and, for exapmle, this article

http://www.python.org/doc/essays/list2str.html

All right, what I should have said is that map() isn't enough faster
than a for loop in the cases where it is faster to make much difference,
and there are definitely cases where it's slower than the equivalent for
loop, and it's usually less readable than some other way of getting
speed.
 
F

Francis Avila

Aahz wrote in message ...
All right, what I should have said is that map() isn't enough faster
than a for loop in the cases where it is faster to make much difference,
and there are definitely cases where it's slower than the equivalent for
loop, and it's usually less readable than some other way of getting
speed.

That essay is a bit old, so not everything it says may still be true.

But in 2.3 at least, one case in which map is almost always much faster is
when the function used is a builtin. In this case, map (which is also
builtin) calls the C function from C, and I'm pretty sure it gets this speed
by staying outside of the Python eval loop. This can be wickedly fast (in
Python terms), and faster than list comps or for-loops with explicit list
appending, as a simple timeit will show.

For *Python* functions and lambdas (non-builtins), map can very often be
slower than a for-loop or list comprehension, for reasons not quite clear to
me. And of course, where a for-loop or list comp can avoid calls
alltogether, it will always be faster than the equivalent function-using map
(unless map uses None as the function).

I think in the OP's case, any little drop of speed he can get more than
justifies the very, very slightly un-idiomatic use of map instead of a
for-loop. Be warned that map() may go away in future Pythons, though....
 
J

Jacek Generowicz

Francis Avila said:
Aahz wrote in message ...

[about map]

Actually, I use map in my code because I find it very elegant and very
readable[*]. Usually, the speed has nothing to do with it. But if I
already have a map, and I'm interested in speed then I'm loathe to
replace it with something slower.
I think in the OP's case, any little drop of speed he can get more than
justifies the very, very slightly un-idiomatic use of map instead of a
for-loop.

Why do you consider my use of map to be un-idiomatic? I want to
transform a sequence into another one, by applying a function to each
element of the original. This is exactly the point of map. What am I missing?
Be warned that map() may go away in future Pythons, though....

Yes, I've heard idle rumours to this effect. I think it would be a
great shame.

Yes, I realize that there are list comprehensions. I appreciate both,
and find that sometimes one looks more natural than the other, and
sometimes it's the other way around. I find this ability to pick the
right-looking idiom worth more than a One Way To Do it principle.


[*] There seems to be a faction in the Python community which
considers map, reduce, filter and lambda to be evil infiltrators
from those perfidious inventions that are functional languages,
and therefore the claim is made that they have no place in
Python. Some newbies believe this **** and the myth spreads. Just
my 2 cents.
 
P

Peter Otten

Jacek said:
One key feature of this is that I can use it to replace any[*]
function with a functionally equivalent but faster one, without having
to change any of the call sites of the original.

Understood, but not until this post.
Strictly speaking, all optimizations that fulfill the above criterium should
be built into the compiler, while memoizing will not for its memory
tradeoffs and dependancy of the input data.
IIUC, Aahz suggested to replace the proxy function with a
dictionary. This sounds reasonable, as, after all, the proxy maps its
arguments to its return values, while a dictionary maps keys to
values. The whole point of this is that function call overhead is
large compared to dictionary lookup, which is (almost) all that the
proxy does. However, it's the "almost" that creates the problem. If
I'm prepared to mess around with the calls to the original functions
and replace them with your suggestion, then, of course the problem is
solved ... however, I _really_ do not want to replace the calls to the
original ...

See below for my ideas towards a generalized memoizing framework. I've
adressed the problem by writing a map() variant that knows about result
caches (essentially Aahz' innerloop() I see). You could use it to shade the
builtin.
def foo(some_type):
...
return something


foo = memoize(foo)

data = [int, float, my_class, his_class, str, other_class] * 10**6

map(foo, data) [+]

Do you discard the result for real?
[+] Can you see why I don't want to mess with the call site ?

No.

Now to my code. The idea is to make memoized functions available to
different modules from a central registry, and to expose the cached results
for hand-optimized hotspots. Otherwise it was all mentioned in this thread,
if not the original post.

<memoize.py>
import __builtin__

class RegistryItem:
""" provide the necessary infrastructure to transparently speed up
one function
"""
def __init__(self, func, argc):
cache = self.cache = {}
def memoized(*args):
try:
return cache[args]
except KeyError:
result = cache[args] = func(*args)
return result

assert argc > 0
if argc == 1:
def callseq(seq):
print seq[:3]
result = []
append = result.append
for arg in seq:
try:
append(cache[arg])
except KeyError:
value = cache[arg] = func(arg)
append(value)
return result
else:
def callseq(*seqs):
result = []
append = result.append
if len(seqs) == 1:
# multiple args as one tuple in the sequence
for args in seqs[0]:
try:
append(cache[args])
except KeyError:
value = cache[args] = func(*args)
append(value)
else:
# multiple args in multiple sequences
# XXX special case (too lazy for now)
return map(memoized, *seqs)
return result
self.wrapped = memoized
self.callseq = callseq
self.original = func

class Registry:
""" Registry for function variants using a result cache
fast = register(slow, <number of args>)
"""
def __init__(self):
self.data = {}
def __call__(self, func, argc):
assert argc > 0
try:
return self.data[func]
except KeyError:
ri = RegistryItem(func, argc)
self.data[func] = ri
self.data[ri.wrapped] = ri
return ri.wrapped

def __getitem__(self, func):
return self.data[func]

def getmap(self):
""" provide a cache-enhanced version of map
"""
data = self.data
def memomap(func, *seqs):
try:
callseq = data[func].callseq
except KeyError:
return __builtin__.map(func, *seqs)
return callseq(*seqs)
return memomap

registry = register = Registry()
</memoize.py>

<example.py>
# for improved clarity of this example I've have omitted
# rebinding the original names and instead used fastXXX
import memoize, time

def slow(arg):
print "no cache or cache miss with %r" % arg
time.sleep(0.1)
return "<%s>" % arg

sample = range(10) * 100

# preparation code

fast = memoize.register(slow, True)
# typically used like so:
# slow = memoize.register(slow)

fastmap = memoize.registry.getmap()
# could also be used like so:
# map = memoize.registry[slow].getmap()


# some tests
assert fast is memoize.registry[fast].wrapped
assert fast is memoize.registry[slow].wrapped
assert slow is memoize.registry[slow].original


# stray function call
print slow(10)

# stray function call using cache
print fast(20)

#built-in loop support
fastmap(slow, sample)
fastmap(fast, sample)

# hand-crafted code using the cache
cache = memoize.registry[slow].cache
for arg in sample:
cache[arg]

</example.py>

Feel free to play with the above. Be warned that there will be bugs.
However, before working on the code I will have to profile to make at least
sure it does not slow down things... maybe this weekend.

Peter
 
B

Bengt Richter

Stuart Bishop wrote in message ...

Not legal. *args must always come after default args. There's no way to do
Never say never ;-)
A function has the __get__ method, which is what getattr magic uses to make a bound method.
You can use this put cache in the "self" position of a bound method, like
... def proxy(cache, *args):
... try: return cache[args]
... except KeyError: return cache.setdefault(args, func(*args))
... return proxy.__get__({}, proxy.__class__)
...
<bound method function.proxy of {(3,): 9, (2,): 4}>

Looking at the code, it might be worth timing, if you have 90%+ cache hits.
I'd be curious to see what you get
3 0 SETUP_EXCEPT 12 (to 15)
3 LOAD_FAST 0 (cache)
6 LOAD_FAST 1 (args)
9 BINARY_SUBSCR
10 RETURN_VALUE
11 POP_BLOCK
12 JUMP_FORWARD 41 (to 56)

4 >> 15 DUP_TOP
16 LOAD_GLOBAL 2 (KeyError)
19 COMPARE_OP 10 (exception match)
22 JUMP_IF_FALSE 29 (to 54)
25 POP_TOP
26 POP_TOP
27 POP_TOP
28 POP_TOP
29 LOAD_FAST 0 (cache)
32 LOAD_ATTR 3 (setdefault)
35 LOAD_FAST 1 (args)
38 LOAD_DEREF 0 (func)
41 LOAD_FAST 1 (args)
44 CALL_FUNCTION_VAR 0
47 CALL_FUNCTION 2
50 RETURN_VALUE
51 JUMP_FORWARD 2 (to 56) 59 RETURN_VALUE

this without changing the calling characteristics of the function, which the
OP has stated is vital.
Should be ok, but I wonder about non-hashable args. Never expect them?
It is probably possible to custom-build a code object which makes _cache a
local reference to the outer cache, and play with the bytecode to make the
LOAD_DEREF into a LOAD_FAST, but it's not worth the effort. I don't know
how fast/slow a LOAD_DEREF is compared to a LOAD_FAST, though. In any case,
it won't help more than a few percent.
The above seems to do it for the cache-hit case ;-)
Jacek, I think this is as fast as pure Python will get you with this
approach. If the speed is simply not acceptable, there are some things you
need to consider, Is this the *real* bottleneck?

Regards,
Bengt Richter
 
J

Jacek Generowicz

Peter Otten said:
See below for my ideas towards a generalized memoizing framework.

[I haven't the time to have a look at it in any detail right now, so
forgive me for not responding/commenting in full, just yet.]
def foo(some_type):
...
return something


foo = memoize(foo)

data = [int, float, my_class, his_class, str, other_class] * 10**6

map(foo, data) [+]

Do you discard the result for real?

No, sorry, of course not.
[+] Can you see why I don't want to mess with the call site ?

No.

It would mean taking it out of the map and that would mean losing the
C-coded loop that map provides.
Now to my code. The idea is to make memoized functions available to
different modules from a central registry, and to expose the cached results
for hand-optimized hotspots. Otherwise it was all mentioned in this thread,
if not the original post.

I'll make sure to find the time to study this.
 
J

Jacek Generowicz

To be honest, I have trouble believing that a local variable lookup
would be significantly faster than a closure lookup ... or is there
some other point to this ?
Not legal. *args must always come after default args. There's no way to do
Never say never ;-)
A function has the __get__ method, which is what getattr magic uses
to make a bound method. You can use this put cache in the "self"
position of a bound method, like
... def proxy(cache, *args):
... try: return cache[args]
... except KeyError: return cache.setdefault(args, func(*args))
... return proxy.__get__({}, proxy.__class__)

Looking at the code, it might be worth timing, if you have 90%+ cache hits.
I'd be curious to see what you get

I get your version being almost imperceptibly slower.

I used this:

import timing
import sys

def Amemoize(func):
def proxy(cache, *args):
try: return cache[args]
except KeyError: return cache.setdefault(args, func(*args))
return proxy.__get__({}, proxy.__class__)


def Bmemoize(func):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, func(*args))
return proxy


N = int(sys.argv[1])

def foo(n):
if n<2: return 1
return foo(n-1) + foo(n-2)

Afoo = Amemoize(foo)
Bfoo = Bmemoize(foo)


Acode = """
def Acall():
""" + "Afoo(10);" * N + "\n"

Bcode = """
def Bcall():
""" + "Bfoo(10);" * N + "\n"

exec Acode
exec Bcode

timing.start()
Acall()
timing.finish()
print timing.micro()

timing.start()
Bcall()
timing.finish()
print timing.micro()
 
B

Bengt Richter

To be honest, I have trouble believing that a local variable lookup
would be significantly faster than a closure lookup ... or is there
some other point to this ?
No, I think you are right. I was mainly reacting to the "there's no way" statement ;-)
Not legal. *args must always come after default args. There's no way to do
Never say never ;-)
A function has the __get__ method, which is what getattr magic uses
to make a bound method. You can use this put cache in the "self"
position of a bound method, like
def sq(x): return x*x ...
def memoize(func): # callable is a builtin ;-)
... def proxy(cache, *args):
... try: return cache[args]
... except KeyError: return cache.setdefault(args, func(*args))
... return proxy.__get__({}, proxy.__class__)

Looking at the code, it might be worth timing, if you have 90%+ cache hits.
I'd be curious to see what you get

I get your version being almost imperceptibly slower.

I used this:
When things are that close, it's very hard to draw conclusions except that they are
probably close ;-) But to split hairs you probably have to run the two tests separately,
in quiescent system situations, so that the first is not improving or worsening the
environment for the second (i.e., in terms of interpreter state re memory, or even cpu
cache in some cases, though that usually only applies to the first time through a loop.
And also keep the fastest of 5-10 trials, to eliminate glitches due to spurious irrelevant
system activity. I don't have your "timing" module, so I don't know what it does. If it
randomly reordered multiloop trials of A vs B and took minimums, then maybe interactions
could be washed out, but otherwise A & B in the same test source leaves some ambiguity
about the real ranking (which of course will not matter, practically speaking;-)
import timing
import sys
[...]

Regards,
Bengt Richter
 
J

Jacek Generowicz

No, I think you are right. I was mainly reacting to the "there's no
way" statement ;-)

Sure ... I was sort of answering to a few posts upthread, as I hadn't
done that yet.
When things are that close, it's very hard to draw conclusions
except that they are probably close ;-) But to split hairs you
probably have to run the two tests separately, in quiescent system
situations,

.... when it's that close, I just don't care :)

When it's that close, the Python version, CPU, temperature, phase of
the moon etc. are probably more significant than the algorithm
.... and, most importantly, my time is better spent on finding more
effective ways of speeding up my code.
I don't have your "timing" module,

Are you sure? It's on every Python installation I currently have
access to, so I assumed it was in the standard distribution. Hmm
.... OK, the library reference lists it under "Obsolete".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top