make faster Richards benchmark

P

Peter Maas

Duncan said:
I'd appreciate any suggestions on how to make faster Python
implementations of Richards benchmark. Perhaps there are obvious
problems that can be corrected?

The most obvious problem: how does the Richards benchmark work?
Care to post the source code?

Mit freundlichen Gruessen,

Peter Maas
 
J

Josef Meile

Duncan said:
I'd appreciate any suggestions on how to make faster Python
implementations of Richards benchmark. Perhaps there are obvious
problems that can be corrected?

http://www.lissett.com/ben/bench1.htm
What's about including a second python implementation of the Richards
benchmark using psyco? You don't have to modify your code, you only have
to add two lines. It would be also interesting to see the differences
between both source codes.

Regards,
Josef
 
D

Duncan Booth

(e-mail address removed) (Duncan Lissett) wrote in
I'd appreciate any suggestions on how to make faster Python
implementations of Richards benchmark. Perhaps there are obvious
problems that can be corrected?

That should say "the Richards benchmark", named for Martin Richards.

Well, if I was going to do a Python implementation of this I would want to
look at what the benchmark is trying to achieve, and then program it fresh
from the ground up instead of trying to ape some other language. In
particular I expect that the classes with 'run' methods would probably
become generators with most or all of their state held as local variables.

For example, imagining that all the 'run' methods have first been renamed
'next', IdleTask becomes (untested):

def IdleTask(scheduler, v1, v2):
for i in range(v2):
if v1 & 1:
v1 = (v1 >> 1) ^ 0xD008
yield scheduler.release(DEVICEB)
else:
v1 = v1 >> 1
yield scheduler.release(DEVICEA)
yield scheduler.holdCurrent()

Next most obvious thing is to junk the linked lists and replace them with
ordinary Python builtin lists. Pass the relevant list around instead of the
id constants 'DEVICEA', 'DEVICEB' and you can get rid of all that lookup
overhead. This involves quite a major restructuring of the scheduler
though: hence my comment about understanding what the code is trying to
achieve and starting from the ground up.

Packet.addTo should look more like:

def addTo(self, queue):
queue.append(self)

and then vapourise entirely.


Minor points:

Remove the pointless use of the 'global' keyword in various places. Replace
the traceOn variable with __debug__ so you get the same benefits as
compiled languages by optimising out the test for the trace statements.

Remove the pointless set/get methods and just access the members directly.
If you are writing a benchmark they will cripple performance.

Avoiding multiple accesses to the same instance variable, or assigning to
instance variables until you are about to return from a method: use a local
during the execution of the method.
 
P

Peter Hansen

Josef said:
What's about including a second python implementation of the Richards
benchmark using psyco? You don't have to modify your code, you only have
to add two lines. It would be also interesting to see the differences
between both source codes.

I get an immediate 38% speedup by doing "import psyco; psyco.full()"
at the start of the benchmark.

-Peter
 
P

Peter Maas

Michel said:
On the site, click on the word "Python #92"

Thanks. Oh, how silly of me. I didn't recognize these as links,
still tuned to underscore links.

Mit freundlichen Gruessen,

Peter Maas
 
D

Duncan Lissett

-snip-
Well, if I was going to do a Python implementation of this I would want to
look at what the benchmark is trying to achieve, and then program it fresh
from the ground up instead of trying to ape some other language. In
particular I expect that the classes with 'run' methods would probably
become generators with most or all of their state held as local variables.

For example, imagining that all the 'run' methods have first been renamed
'next', IdleTask becomes (untested):

def IdleTask(scheduler, v1, v2):
for i in range(v2):
if v1 & 1:
v1 = (v1 >> 1) ^ 0xD008
yield scheduler.release(DEVICEB)
else:
v1 = v1 >> 1
yield scheduler.release(DEVICEA)
yield scheduler.holdCurrent()

Next most obvious thing is to junk the linked lists and replace them with
ordinary Python builtin lists. Pass the relevant list around instead of the
id constants 'DEVICEA', 'DEVICEB' and you can get rid of all that lookup
overhead. This involves quite a major restructuring of the scheduler
though: hence my comment about understanding what the code is trying to
achieve and starting from the ground up.

Packet.addTo should look more like:

def addTo(self, queue):
queue.append(self)

and then vapourise entirely.


Minor points:

Remove the pointless use of the 'global' keyword in various places. Replace
the traceOn variable with __debug__ so you get the same benefits as
compiled languages by optimising out the test for the trace statements.

Thanks, these are all interesting suggestions.

Remove the pointless set/get methods and just access the members directly.
If you are writing a benchmark they will cripple performance.

Avoiding multiple accesses to the same instance variable, or assigning to
instance variables until you are about to return from a method: use a local
during the execution of the method.

The pointless set/get methods are pointless in the other language
implementations as well - that's the point. (And I'll cripple that
Oberon-2 implementation real soon by enforcing privacy with modules
and set/get procedures.)

OTOH there's definitely a place for a truly Pythonic implementation
here:
http://www.lissett.com/ben/bench3.htm

best wishes, Duncan
 
D

Duncan Lissett

Wilk said:
import psyco
psyco.full()

2 times faster :)

And Simon: "I just tried the benchmark with Psyco, and cut the run
time for input=10000 from 8.1 seconds to 3.4"

Joseph, Simon, Wilk: thanks for the Psyco suggestion, and special
thanks for going ahead and trying it - we get slightly better than x2.

Given how little I know about Python, my assumption is that there's
scope for x10 improvement by writing better code... a more Pythonic
version for http://www.lissett.com/ben/bench3.htm

Suggestions appreciated.

best wishes, Duncan
 
P

Peter Hansen

Duncan said:
And Simon: "I just tried the benchmark with Psyco, and cut the run
time for input=10000 from 8.1 seconds to 3.4"

Joseph, Simon, Wilk: thanks for the Psyco suggestion, and special
thanks for going ahead and trying it - we get slightly better than x2.

What platform is this on? On WinXP, with an AMD 2500+ chip, and lots
of memory etc., I'm still getting only 38% speedup, nowhere near 100%...

-Peter
 
J

Jack Diederich

The pointless set/get methods are pointless in the other language
implementations as well - that's the point. (And I'll cripple that
Oberon-2 implementation real soon by enforcing privacy with modules
and set/get procedures.)

I would leave python and some other languages out of the comparison.
You can do near line-for-line translations for languages in the same class
- say C++ and Java, or Assembly and C. Requiring the python/perl/ruby
versions to look like a C++ program just measures how badly C++ maps to
Python, and not much else.

I've even seen some line-for-line translations in production use that
use python list-of-strings where the orignal version used char arrarys.
You can imagine what that does for performance *shudder*.

Writing benchmarks is just hard, if you allow people to solve the problem
in whatever way they like you end up measuring how good a coder the
language A guy is compared to the language B submitter.

-jackdied
 
P

Peter Hansen

Jack said:
I would leave python and some other languages out of the comparison.
You can do near line-for-line translations for languages in the same class
- say C++ and Java, or Assembly and C. Requiring the python/perl/ruby
versions to look like a C++ program just measures how badly C++ maps to
Python, and not much else.

I've even seen some line-for-line translations in production use that
use python list-of-strings where the orignal version used char arrarys.
You can imagine what that does for performance *shudder*.

Writing benchmarks is just hard, if you allow people to solve the problem
in whatever way they like you end up measuring how good a coder the
language A guy is compared to the language B submitter.

In the case of the Richards benchmark, it's better to say "specifying
benchmarks is hard". The problem that I saw with it is that it
describes the problem in terms that are inappropriate for many of the
languages it has been translated to, effectively constraining the
implementation in ways that are more suitable to certain languages
than others.

One key example is the requirement for "global registers" which act
similarly to how the low-level registers in the CPU have to be handled,
specifically by saving and restoring them in a task-local context
whenever a task switch occurs. This is clearly what Richards really
wanted, but I'd argue that the approach is ill-conceived, especially
in this century...

On the other hand, if the specification had been more thoroughly
thought out, or perhaps modernized is a more appropriate way of looking
at it, then it would describe the expected *use and behaviour* of the
tasks in terms of black box testability... in other words, it shouldn't
matter that "global registers" are used, but merely that the
benchmark produces certain results.

Without those tests, it's fairly open to interpretation just how
much deviation from the original BCPL implementation is appropriate.
Duncan L's ideas for a generator-based approach are interesting,
though there are probably those who would argue that it wouldn't
follow the benchmark specs exactly. (Especially if it showed
performance much closer to C. <wink>)

-Peter
 
D

Duncan Lissett

-snip-
Next most obvious thing is to junk the linked lists and replace them with
ordinary Python builtin lists.

After the Packet links were replace with a Python list the code was
~1% slower...
Replace the traceOn variable with __debug__ so you get the same benefits as
compiled languages by optimising out the test for the trace statements.

Using __debug__ made the code ~1% faster
Remove the pointless set/get methods and just access the members directly.

As a special dispensation to Python, directly accessed Packet and Tcb
fields, ~10% faster

Avoiding multiple accesses to the same instance variable, or assigning to
instance variables until you are about to return from a method: use a local
during the execution of the method.

Seems we're pushing into code optimization rather than not coding
badly.

Being careful about the tests in conditionals, and reordering some
conditionals sped things up some more. (I should roll the same changes
into the other langauge implementations.)

The OO style implementation took 104.4s and now takes 87.1s

I took the OO style implementation, made the methods of the scheduler
class into functions, made the fields into globals, and made the Task
class run methods into top-level functions.

The C style implementation took 114.9s and the OO conversion takes
80.3s
http://www.lissett.com/ben/bench3.htm

Further suggestions are welcome (yes, I will try pysco later)
 
D

Duncan Lissett

Jack Diederich said:
I would leave python and some other languages out of the comparison.
You can do near line-for-line translations for languages in the same class
- say C++ and Java, or Assembly and C. Requiring the python/perl/ruby
versions to look like a C++ program just measures how badly C++ maps to
Python, and not much else.

There's no requirement that Python versions look like a C++ program.
Peter Hansen chose to write a Python version closely based on the C
implementation. I've now replaced that with a faster version initially
based on a Smalltalk OO implementation.

-snip-
Writing benchmarks is just hard, if you allow people to solve the problem
in whatever way they like you end up measuring how good a coder the
language A guy is compared to the language B submitter.

Maybe the hardest thing about benchmarks is keeping some perspective
on what they might and might not mean ;-)

Yes, the fine performance of the BCPL interpreter implementation (and
the C implementation) has a lot to do with programmer skill. But I'm
not that sure about the other versions ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top