Speed: bytecode vz C API calls

Jacek Generowicz · Dec 8, 2003

I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

which, is functionally equivalent to

class memoize:

def __init__ (self, callable):
self.cache = {}
self.callable = callable

def __call__ (self, *args):
try: return self.cache[args]
except KeyError:
return self.cache.setdefault(args, self.callable(*args))

though the latter is about twice as slow.

I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

The closure-based version seems impossible to recode in C (anyone know
a way?), so I decided to write an extension type equivalent to "class
memoize" above. This seems to run about 10% faster than the pure
Python version ... which is still a lot slower than the pure Python
closure-based version.

I was expecting C-extensions which make lots of calls to Python C API
functions, not to be spectacularly fast, but I'm still a little
disapponited with what I've got. Does this sort of speedup (or rather,
lack of it) seem right, to those of you experienced with this sort of
thing? or does it look like I'm doing it wrong?

Could anyone suggest how I could squeeze more speed out of the
memoizer? (I'll include the core of my memoize extension type at the below.)

[What's the current state of the art wrt profiling Python, and its
extension modules? I've tried using hotshot (though not extensively
yet), but it seems to show even less information than profile, at
first blush.]

The memoize extension type is based around the following:

typedef struct {
PyObject_HEAD
PyObject* x_attr;
PyObject* cache;
PyObject* fn;
} memoizeObject;

static int
memoize_init(memoizeObject* self, PyObject* args, PyObject* kwds) {
PyArg_ParseTuple(args, "O", &(self->fn));
Py_INCREF(self->fn);
self->cache = PyDict_New();
return 0;
}

static PyObject*
memoize_call(memoizeObject* self, PyObject* args) {
PyObject* value = PyDict_GetItem(self->fn, args);
if (! value) {
PyDict_SetItem(self->cache, args,
PyObject_CallObject(self->fn, args));
value = PyDict_GetItem(self->cache, args);
}
//Py_INCREF(value);
return value;
};

Thanks,

Jacek Generowicz · Dec 8, 2003

Jacek Generowicz said:
I was expecting C-extensions which make lots of calls to Python C API
functions, not to be spectacularly fast, but I'm still a little
disapponited with what I've got. Does this sort of speedup (or rather,
lack of it) seem right, to those of you experienced with this sort of
thing? or does it look like I'm doing it wrong?

static int
memoize_init(memoizeObject* self, PyObject* args, PyObject* kwds) {
PyArg_ParseTuple(args, "O", &(self->fn));

Obvious bug: &(self->cache) !!

Py_INCREF(self->fn);
self->cache = PyDict_New();
return 0;
}

Fixing the bug gives me a factor of 3 speedup over the class-based
version.

That's a bit more like what I was expecting.

Michael Hudson · Dec 8, 2003

Jacek Generowicz said:
Could anyone suggest how I could squeeze more speed out of the
memoizer? (I'll include the core of my memoize extension type at the
below.)

You *might* get a speed bump by making your memoized callable a method
of the memoizer object rather than implementing tp_call.

Cheers,
mwh

Terry Reedy · Dec 8, 2003

Jacek Generowicz said:
I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy ....
I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

Have you tried psyco on above? Or rather, on the proxies?
Can any of your callables be memoized by list rather than dict?
(ie, any count or restricted int args?)

If you have just a few wrapped functions, faster, special-cased proxies
(with exact arg lists) would be feasible and possibly easier for psyco.
Special-casing would include inlining the callable when possible
-- or rather, inlining the memoization within the function.

TJR

Aahz · Dec 8, 2003

I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

I'm wondering whether you're tackling something the wrong way or perhaps
you need to throw hardware at the problem. Dicts are one of the most
highly optimized parts of Python, and if your cache hit rate is higher
than 90%, you'll be dealing with exceptions only rarely, so the speed of
cache.setdefault() isn't where your time is going. I'm wondering whether
you're hitting some kind of memory issue or something, or possibly you're
having cache effects (in CPU/memory), or maybe even your key is poorly
constructed so that you're getting hash duplicates.

OTOH, if the cache hit rate is smaller than that, you'll be paying the
penalty for raised exceptions *and* the callable() call, in which case
redesigning that part will pay the highest dividends.

Jacek Generowicz · Dec 9, 2003

Terry Reedy said:
Jacek Generowicz said:

I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy ...
I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

Click to expand...

Have you tried psyco on above?

Nope, but psyco would be politically unacceptable for my users, in
this case. OTOH, I've been wanting to play with psyco for ages ...

Can any of your callables be memoized by list rather than dict?
(ie, any count or restricted int args?)

You've lost me there.

Jacek Generowicz · Dec 9, 2003

I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

Click to expand...

I'm wondering whether you're tackling something the wrong way

There _is_ always that possibility, but I don't think that's what my
problem is here (it may well be elsewhere in my program).

or perhaps you need to throw hardware at the problem.

The definition of "not fast enough" in this case, is "much slower than
a competing C++ implementation", so throwing hardware at it advances
me not one bit, I'm afraid.

Dicts are one of the most highly optimized parts of Python,

Which is exactly why I'm using them extensively.

and if your cache hit rate is higher than 90%, you'll be dealing
with exceptions only rarely,

My cache hit rate is well over 90%.

so the speed of cache.setdefault() isn't where your time is going.
I'm wondering whether you're hitting some kind of memory issue or
something, or possibly you're having cache effects (in CPU/memory),
or maybe even your key is poorly constructed so that you're getting
hash duplicates.

No, I think that my memoizer works just fine. The "problem" is that
I've memoized a lot of functions, and quite a few of them are being
called in my inner loops. If I want to compete on speed with C++, I'm
simply going have to eliminate Python bytecode from my inner loops[*].

[*] Unless my overall algorithm is completely wrong, of course.

Jacek Generowicz · Dec 9, 2003

Jacek Generowicz said:
Obvious bug: &(self->cache) !!

Apologies, I misreported the bug. self->fn is appropriate here
.... using self->fn (instread of self->cache) as the dictionary in the
other function is what was wrong, of course.

Fixing the bug gives me a factor of 3 speedup over the class-based
version.

This is still true.

Michael Hudson · Dec 9, 2003

Terry Reedy said:
Jacek Generowicz said:

I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy ...
I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

Click to expand...

Have you tried psyco on above? Or rather, on the proxies?

I doubt that would help -- the largest overhead is likely the function
call, and I think calling across the pysco/python boundary is
expensive.

Cheers,
mwh

Aahz · Dec 9, 2003

[email protected] (Aahz) said:
[email protected] (Aahz) said:

I have a program in which I make very good use of a memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

I've got to the stage where my program is still not fast enough, and
calls to the memoizer proxy are topping profiler output table. So I
thought I'd try to see whether I can speed it up by recoding it in C.

Click to expand...

I'm wondering whether you're tackling something the wrong way

Click to expand...

There _is_ always that possibility, but I don't think that's what my
problem is here (it may well be elsewhere in my program).

or perhaps you need to throw hardware at the problem.

Click to expand...

The definition of "not fast enough" in this case, is "much slower than
a competing C++ implementation", so throwing hardware at it advances
me not one bit, I'm afraid.

Dicts are one of the most highly optimized parts of Python,

Click to expand...

Which is exactly why I'm using them extensively.

and if your cache hit rate is higher than 90%, you'll be dealing
with exceptions only rarely,

Click to expand...

My cache hit rate is well over 90%.

All your comments taken as a set make no sense. There's just no way
that a bunch of dict accesses in Python can be much slower than a C++
program. (C program, *maybe* -- not C++)

so the speed of cache.setdefault() isn't where your time is going.
I'm wondering whether you're hitting some kind of memory issue or
something, or possibly you're having cache effects (in CPU/memory),
or maybe even your key is poorly constructed so that you're getting
hash duplicates.

Click to expand...

No, I think that my memoizer works just fine. The "problem" is that
I've memoized a lot of functions, and quite a few of them are being
called in my inner loops. If I want to compete on speed with C++, I'm
simply going have to eliminate Python bytecode from my inner loops[*].

[*] Unless my overall algorithm is completely wrong, of course.

Waitasec .... okay, I see what your problem is, assuming you're
describing it correctly. The problem is that you're calling Python
functions in an inner loop. I think you're going to need to write a
solution that doesn't call functions (function unrolling, so to speak).
Instead of returning a memoizer function, just use a dict of dicts.

Jacek Generowicz · Dec 9, 2003

Waitasec .... okay, I see what your problem is, assuming you're
describing it correctly. The problem is that you're calling Python
functions in an inner loop.

.... indeedy ...

I think you're going to need to write a solution that doesn't call
functions (function unrolling, so to speak). Instead of returning a
memoizer function, just use a dict of dicts.

Sounds interesting. But how do I get the dictionary to use the
function which it is memoizing, to calculate values which haven't been
cached yet?

Hmm ... maybe by subclassing dict, I can add the necessary
functionality ... but even then I don't see how to avoid wrapping the
dictionary access in a function call ... and then we're back to square
one.

Peter Otten · Dec 9, 2003

Jacek said:
... indeedy ...

Sounds interesting. But how do I get the dictionary to use the
function which it is memoizing, to calculate values which haven't been
cached yet?

Guessing into the blue:

try:
result = cache[fun, args]
except KeyError:
result = cache[fun, args] = fun(args)

should make a fast inner loop if cache misses are rare.

Hmm ... maybe by subclassing dict, I can add the necessary
functionality ... but even then I don't see how to avoid wrapping the
dictionary access in a function call ... and then we're back to square
one.

Subclassing dict will not make it faster. Maybe it's time to show some inner
loop lookalike to the wider public for more to the point suggestions...

Peter

Terry Reedy · Dec 9, 2003

Jacek Generowicz said:
You've lost me there.

Lookup tables. Dicts can memoize any function. Lists can memoize or
partially memoize functions with one or more periodic domains. Perhaps I am
being too obvious or simple for *you*, but I have seen people memoize fib()
or fac() with a generic dict-based wrapper, which is overkill compared to a
simple list.

In another post, you said

If I want to compete on speed with C++, I'm
simply going have to eliminate Python bytecode from my inner loops[*].
[*] Unless my overall algorithm is completely wrong, of course.

For speed, you need to eliminate *executed* and *expensive* bytecodes, not
static, possibly skipped-over, bytecodes. I'll take that as what you meant.
It is certainly what our comments are aimed at. Hence my other suggestion
to eliminate a layer of (expensive) function calls by combining value
calculation and storage in one function (more space, less time). More
specific suggestions will require more specific examples.

Terry J. Reedy

Aahz · Dec 9, 2003

Sounds interesting. But how do I get the dictionary to use the
function which it is memoizing, to calculate values which haven't been
cached yet?

Same way you decide which function to call currently. Essentially, you
do something like this:

def foo():
pass

def bar():
pass

lookup = {
'foo': {'func': foo, 'cache': {}},
'bar': {'func': bar, 'cache': {}},
}

for operation,args in WorkQueue:
curr = lookup[operation]
try:
return curr['cache'][args]
except KeyError:
return curr['cache'].setdefault(args, curr['func'](*args))

(Note: untested; you'll have to fill in the blanks)

Jacek Generowicz · Dec 9, 2003

Terry Reedy said:
Lookup tables. Dicts can memoize any function. Lists can memoize or
partially memoize functions with one or more periodic domains. Perhaps I am
being too obvious or simple for *you*, but I have seen people memoize fib()
or fac() with a generic dict-based wrapper, which is overkill compared to a
simple list.

The arguments passed to the functions I have memoized are usually
types, so, no, I cannot memoize them by list.

In another post, you said

If I want to compete on speed with C++, I'm
simply going have to eliminate Python bytecode from my inner loops[*].
[*] Unless my overall algorithm is completely wrong, of course.

Click to expand...

For speed, you need to eliminate *executed* and *expensive* bytecodes, not
static, possibly skipped-over, bytecodes. I'll take that as what you meant.
It is certainly what our comments are aimed at. Hence my other suggestion
to eliminate a layer of (expensive) function calls by combining value
calculation and storage in one function (more space, less time).

You are suggesting that I hand-memoize my functions, rather than using
the generic memoizer I posted at the top of the thread? in order to
save, in the case of cache misses, the function call overhead to the
memoized function? If I had a high proportion of cache misses, then I
suspect that would be worth while. In my case, however, I doubt that
would have any appreciable effect. Almost all the calls to the
memoized functions are cache hits, so effectively there is no value
calculation ... it's pretty much all lookup, so my "expensive"
function is *effectively*:

def proxy(*args):
try:
return cache[args]
except KeyError: pass

(where proxy is a closure over cache). Yes, there is something other
than "pass" in the handler, but we (essentially) never go there.

There's really not much left to play with (though I'm open to
suggestions), which is why I wanted to try recoding in C.

Jacek Generowicz · Dec 9, 2003

Peter Otten said:
Jacek said:

Sounds interesting. But how do I get the dictionary to use the
function which it is memoizing, to calculate values which haven't been
cached yet?

Click to expand...

Guessing into the blue:

try:
result = cache[fun, args]
except KeyError:
result = cache[fun, args] = fun(args)

Allow me to remind you of the memoizer I posted at the top of the thread:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

One key feature of this is that I can use it to replace any[*]
function with a functionally equivalent but faster one, without having
to change any of the call sites of the original.

IIUC, Aahz suggested to replace the proxy function with a
dictionary. This sounds reasonable, as, after all, the proxy maps its
arguments to its return values, while a dictionary maps keys to
values. The whole point of this is that function call overhead is
large compared to dictionary lookup, which is (almost) all that the
proxy does. However, it's the "almost" that creates the problem. If
I'm prepared to mess around with the calls to the original functions
and replace them with your suggestion, then, of course the problem is
solved ... however, I _really_ do not want to replace the calls to the
original ...

Maybe it's time to show some inner loop lookalike to the wider
public for more to the point suggestions...

Sure.

def foo(some_type):
...
return something

foo = memoize(foo)

data = [int, float, my_class, his_class, str, other_class] * 10**6

map(foo, data) [+]

[*] ... with hashable arguments, etc. etc.

[+] Can you see why I don't want to mess with the call site ?

Jacek Generowicz · Dec 9, 2003

^^^^^^^^

(Maybe this is where we are misunderstanding eachother. On first
reading, I thought that "memoizer" was just a typo. I don't return a
memoizer function, I return a memoized function.)

Same way you decide which function to call currently.

The way I decide which function to call currently is to type its name
directly into the source at the call site. Deciding which function to
call is not the problem: calling any function at all, once I've
reduced myself to having _just_ a dictionary, is the problem.

Maybe I didn't make the use of "memoize" clear enough.

You use it thus:

def foo():
...

foo = memoize(foo)

And from now on foo is a lot faster than it was before[*].

Essentially, you do something like this:

def foo():
pass

def bar():
pass

lookup = {
'foo': {'func': foo, 'cache': {}},
'bar': {'func': bar, 'cache': {}},
}

for operation,args in WorkQueue:
curr = lookup[operation]
try:
return curr['cache'][args]
except KeyError:
return curr['cache'].setdefault(args, curr['func'](*args))

Nononononooooo !

In any given loop, I know _which_ function I want to call, so the
dictionary of dictionaries is pointless. Let's recap the original
memoizer:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

I thought that you were suggesting to replace "proxy" with just the
dictionary "cache", in order to save the function call overhead, as
proxy is (almost) just a wrapper around the dictionary lookup. That's
great, but what do I do in the (very rare) cases where I'm presented
with an argument/key I haven't seen yet? Somehow, I need to recognize
that the key is not present, call the original function, and add the
key-value pair to the cache. I can't see how to do that without
re-introducing the function call. I could to it by inlining the try
block, but then I wouldn't be able to use the memoized function inside
map ... and turning a map into a for-loop would lose me more than I'd
gained.

[*] With a whole bunch of exceptions which are beside the point here.

Stuart Bishop · Dec 10, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Peter Otten said:
Peter Otten said:

Jacek said:

(e-mail address removed) (Aahz) writes:

I think you're going to need to write a solution that doesn't call
functions (function unrolling, so to speak). Instead of returning a
memoizer function, just use a dict of dicts.

Sounds interesting. But how do I get the dictionary to use the
function which it is memoizing, to calculate values which haven't
been
cached yet?

Click to expand...

Guessing into the blue:

try:
result = cache[fun, args]
except KeyError:
result = cache[fun, args] = fun(args)

Click to expand...

Allow me to remind you of the memoizer I posted at the top of the
thread:

def memoize(callable):
cache = {}
def proxy(*args):
try: return cache[args]
except KeyError: return cache.setdefault(args,
callable(*args))
return proxy

One key feature of this is that I can use it to replace any[*]
function with a functionally equivalent but faster one, without having
to change any of the call sites of the original.

proxy does. However, it's the "almost" that creates the problem. If
I'm prepared to mess around with the calls to the original functions
and replace them with your suggestion, then, of course the problem is
solved ... however, I _really_ do not want to replace the calls to the
original ...

About the only thing that could be done then is:

def memoize(callable):
cache = {}
def proxy(*args, _cache=cache):
try: return _cache[args]
except KeyError: return cache.setdefault(args, callable(*args))
return proxy

Might help a little bit, although probably not enough.

- --
Stuart Bishop <[email protected]>
http://www.stuartbishop.net/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)

iD8DBQE/1n5TAfqZj7rGN0oRAlSQAJ9WZe39F8L1KECCMTS3HxUnWKGx8wCgjZ8P
68pLTP33rlFI0eSEHK8FVXY=
=25Fx
-----END PGP SIGNATURE-----

Francis Avila · Dec 10, 2003

Stuart Bishop wrote in message ...

About the only thing that could be done then is:

def memoize(callable):
cache = {}
def proxy(*args, _cache=cache):

Not legal. *args must always come after default args. There's no way to do
this without changing the calling characteristics of the function, which the
OP has stated is vital.

It is probably possible to custom-build a code object which makes _cache a
local reference to the outer cache, and play with the bytecode to make the
LOAD_DEREF into a LOAD_FAST, but it's not worth the effort. I don't know
how fast/slow a LOAD_DEREF is compared to a LOAD_FAST, though. In any case,
it won't help more than a few percent.

Jacek, I think this is as fast as pure Python will get you with this
approach. If the speed is simply not acceptable, there are some things you
need to consider, Is this the *real* bottleneck?

Consider this before you delve into C solutions, which greatly increase
headaches. An easy way to check is to cut out the map and memoization and
just use a dictionary preloaded with known argument:value pairs for a given
set of test data.

e.g.,

def func(...):
...

data = [foo, bar, ....]
cache = dict([(i, func(i)) for i in data])

def timetrial(_data=data, _cache=cache):
return map(_cache.__getitem__, _data)
# [operator.getitem(_cache, i) for i in _data] *might* be faster.

then timeit.py -s"import ttrial" "ttrial.timetrial()"

If it's still too slow, then either Python simply isn't for this task, or
you need to optimize your functions.

If this isn't the real bottleneck:
- Perhaps the functions I'm memoizing need optimization themselves?
- Are there any modules out there which implement these in C already?
- How does the competing C++ code do it? Does it get its speed from some
wicked data structure for the memoizing, or are the individual functions
faster, or what?

If this *is* the only significant bottleneck, but you want to try harder to
speed it up, you can pursue the C closure path you were mentioning. With
the caveat that I don't do any CPython hacking, I know that closures are
implemented using co_cellvars and the LOAD_DEREF bytecode. The Python C api
does describe an api to cell objects, so that will be the place to look, if
it's possible to do a closure in C. It may not be faster than your
straightforward approach, of course.

However, comparing your Python to a C++ version on the basis of speed is
most certainly the wrong way to go. If the C++ is well designed with good
algorithms, its going to be faster than Python. Period. The advantage of
Python is that it's much faster and easier to *write* (and less likely to
introduce bugs), easier to *read* and maintain, and can model your problem
more intuitively and flexibly. But if you need maximum speed at all costs,
the best Python can do for you is act as a glue language to bits and pieces
of hand-crafted C.

Do you really *need* the speed, or are you just bothered that it's not as
fast as the C++ version?

Jacek Generowicz · Dec 10, 2003

Francis Avila said:
It is probably possible to custom-build a code object which makes _cache a
local reference to the outer cache, and play with the bytecode to make the
LOAD_DEREF into a LOAD_FAST, but it's not worth the effort.

I'm tempted to agree.

Jacek, I think this is as fast as pure Python will get you with this
approach.

That's pretty much what I thought ... which is why I was (reluctantly)
investigating the C approach.

If the speed is simply not acceptable, there are some things you
need to consider, Is this the *real* bottleneck?

Well, when I profile the code, proxy comes out at the top in terms of
total time. Yes, there are other functions that are hotspots too. Yes,
maybe I'm being completely stupid, and 99% of those calls are unnecessary.

However, I simply thought that if I can increase the speed of "proxy"
by a factor of 3 or 7 or something, without too much pain, by recoding
it in C, then I'd get a noticeable speed gain overall. If it involves
re-hacking the interpreter, then I won't bother.

Consider this before you delve into C solutions, which greatly increase
headaches.

You don't need to tell me that

An easy way to check is to cut out the map and memoization and
just use a dictionary preloaded with known argument:value pairs for a given
set of test data.

e.g.,

def func(...):
...

data = [foo, bar, ....]
cache = dict([(i, func(i)) for i in data])

def timetrial(_data=data, _cache=cache):
return map(_cache.__getitem__, _data)
# [operator.getitem(_cache, i) for i in _data] *might* be faster.

then timeit.py -s"import ttrial" "ttrial.timetrial()"

I tried something similar, namely inlining the

try: return cache[args]
except KeyError: return cache.setdefault(args, callable(*args))

rather than using proxy. This gives a little over a factor of 3
speedup, but it's not something I can use everywhere, as often the
proxy is called inside map(...)

If it's still too slow, then either Python simply isn't for this task, or
you need to optimize your functions.

If this isn't the real bottleneck:
- Perhaps the functions I'm memoizing need optimization themselves?

But, to a pretty good approximation, they _never_ get called (the
value is (almost) always looked up). How would optimizing a function
that doesn't get called, help ?

- Are there any modules out there which implement these in C
already?

Other than the competing C++ implementation, no.

- How does the competing C++ code do it? Does it get its speed from some
wicked data structure for the memoizing, or are the individual functions
faster, or what?

I have about 10 functions, written in Python, which the profiler
highlights as taking up most of the time, between them (of which
"proxy" is the main culprit). The C++ version doesn't have _any_
Python in its inner loops. I think it's as simple as that.

(Admittedly, a group of 4 of my functions was initially written with
development ease in mind, and should be replaced by a single mean and
lean C(++) function. That's probably where the real bottleneck is.)

If this *is* the only significant bottleneck,

No, it's not the _only_ bottleneck.

Yes, I am pursuing other paths too. But I feel more confident that I
know what I am doing in the other cases, so I won't bother the list
with those. If you remember, my original question was about what I
could expect in terms of speedup when recoding a Python function in C,
which will make many calls to the Python C API. It's not that I'm
determined to fix my whole speed problem just in this one area, but I
would like to understand a bit more about what to expect when going to
C.

but you want to try harder to speed it up, you can pursue the C
closure path you were mentioning.

Yes, I think that this could be edifying. It's also likely to be a
huge pain, so I'm not that determined to pursue it, as I do have more
productive things to be getting on with.

However, comparing your Python to a C++ version on the basis of speed is
most certainly the wrong way to go. If the C++ is well designed with good
algorithms, its going to be faster than Python. Period.

Not if my inner loops are C.

(BTW, I don't believe that the C++ version is all that wonderfully
designed, but enough about that

But if you need maximum speed at all costs, the best Python can do
for you is act as a glue language to bits and pieces of hand-crafted
C.

I think that it's the other way around, in this case. I build up a
whole structure in Python ... and just need C(++) to call a very small
subset of the functionality I provide, very efficiently.

Do you really *need* the speed, or are you just bothered that it's not as
fast as the C++ version?

Politically speaking, yes, I need the speed. If a silly benchmark (one
which completely misrepresents reality, BTW), shows that the Python
version is slower than the C++ one, then the Python one is terminated.

However, as you point out:

The advantage of Python is that it's much faster and easier to
*write* (and less likely to introduce bugs), easier to *read* and
maintain, and can model your problem more intuitively and flexibly.

.... and for these reasons I am very keen to ensure that the Python
version survives ... which, ironically enough, means that it must be
fast, even though I personally don't care that much about the speed.

how avoid delay while returning from C-python api?	0	May 28, 2014
How to create an instance of a python class from C++	7	Mar 5, 2014
simplest c python api callback example	0	Jun 26, 2008
C API with args and *kw	3	Oct 14, 2008
[c-api]Transmutation of an extension object into a read-only bufferadding an integer in-place.	11	Aug 10, 2012
How do I call super() in c extention ?	0	Jan 27, 2012
memoize again	0	Nov 21, 2009
embedding and extending python C API registering callback handlerobjects	1	Jun 27, 2008

Speed: bytecode vz C API calls

Jacek Generowicz

Jacek Generowicz

Michael Hudson

Terry Reedy

Aahz

Jacek Generowicz

Jacek Generowicz

Jacek Generowicz

Michael Hudson

Aahz

Jacek Generowicz

Peter Otten

Terry Reedy

Aahz

Jacek Generowicz

Jacek Generowicz

Jacek Generowicz

Stuart Bishop

Francis Avila

Jacek Generowicz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads