Adding a Par construct to Python?

jeremy · May 17, 2009

From a user point of view I think that adding a 'par' construct to
Python for parallel loops would add a lot of power and simplicity,
e.g.

par i in list:
updatePartition(i)

There would be no locking and it would be the programmer's
responsibility to ensure that the loop was truly parallel and correct.

The intention of this would be to speed up Python execution on multi-
core platforms. Within a few years we will see 100+ core processors as
standard and we need to be ready for that.

There could also be parallel versions of map, filter and reduce
provided.

BUT...none of this would be possible with the current implementation
of Python with its Global Interpreter Lock, which effectively rules
out true parallel processing.

See: http://jessenoller.com/2009/02/01/python-threads-and-the-global-interpreter-lock/

What do others think?

Jeremy Martin

jeremy · May 17, 2009

From a user point of view I think that adding a 'par' construct to
Python for parallel loops would add a lot of power and simplicity,
e.g.

par i in list:
updatePartition(i)

....actually, thinking about this further, I think it would be good to
add a 'sync' keyword which causes a thread rendezvous within a
parallel loop. This would allow parallel loops to run for longer in
certain circumstances without having the overhead of stopping and
restarting all the threads, e.g.

par i in list:
for j in iterations:
updatePartion(i)
sync
commitBoundaryValues(i)
sync

This example is a typical iteration over a grid, e.g. finite elements,
calculation, where the boundary values need to be read by neighbouring
partitions before they are updated. It assumes that the new values of
the boundary values are stored in temporary variables until they can
be safely updated.

Jeremy

Steven D'Aprano · May 17, 2009

From a user point of view I think that adding a 'par' construct to
Python for parallel loops would add a lot of power and simplicity, e.g.

par i in list:
updatePartition(i)

There would be no locking and it would be the programmer's
responsibility to ensure that the loop was truly parallel and correct.

What does 'par' actually do there?

Given that it is the programmer's responsibility to ensure that
updatePartition was actually parallelized, couldn't that be written as:

for i in list:
updatePartition(i)

and save a keyword?

bearophileHUGS · May 17, 2009

Jeremy Martin, nowadays a parallelfor can be useful, and in future
I'll try to introduce similar things in D too, but syntax isn't
enough. You need a way to run things in parallel. But Python has the
GIL.
To implement a good parallel for your language may also need more
immutable data structures (think about "finger trees"), and pure
functions can improve the safety of your code a lot, and so on.

The multiprocessing module Python2.6 already does something like what
you are talking about. For example I have used the parallel map of
that module to almost double the speed of a small program of mine.

Bye,
bearophile

Steven D'Aprano · May 17, 2009

My reading of the OP is that it tells the interpreter that it can
execute any/all iterations of updatePartion(i) in parallel (or
presumably serially in any order) rather than serially in a strict
sequence.

No, because a "for" loop is defined to execute it's iterations serially
in a specific order. OTOH, a "par" loop is required to execute once for
each value, but those executions could happen in parallel or in any
order.

At least that's how I understood the OP.

I can try guessing what the OP is thinking just as well as anyone else,
but "in the face of ambiguity, refuse the temptation to guess"

It isn't clear to me what the OP expects the "par" construct is supposed
to actually do. Does it create a thread for each iteration? A process?
Something else? Given that the rest of Python will be sequential (apart
from explicitly parallelized functions), and that the OP specifies that
updatePartition still needs to handle its own parallelization, does it
really matter if the calls to updatePartition happen sequentially?

If it's important to make the calls in arbitrary order, random.shuffle
will do that. If there's some other non-sequential and non-random order
to the calls, the OP should explain what it is. What else, if anything,
does par do, that it needs to be a keyword and statement rather than a
function? What does it do that (say) a parallel version of map() wouldn't
do?

The OP also suggested:

"There could also be parallel versions of map, filter and reduce
provided."

It makes sense to talk about parallelizing map(), because you can
allocate a list of the right size to slot the results into as they become
available. I'm not so sure about filter(), unless you give up the
requirement that the filtered results occur in the same order as the
originals.

But reduce()? I can't see how you can parallelize reduce(). By its
nature, it has to run sequentially: it can't operate on the nth item
until it is operated on the (n-1)th item.

MRAB · May 17, 2009

Steven said:
I can try guessing what the OP is thinking just as well as anyone else,
but "in the face of ambiguity, refuse the temptation to guess"

It isn't clear to me what the OP expects the "par" construct is supposed
to actually do. Does it create a thread for each iteration? A process?
Something else? Given that the rest of Python will be sequential (apart
from explicitly parallelized functions), and that the OP specifies that
updatePartition still needs to handle its own parallelization, does it
really matter if the calls to updatePartition happen sequentially?

If it's important to make the calls in arbitrary order, random.shuffle
will do that. If there's some other non-sequential and non-random order
to the calls, the OP should explain what it is. What else, if anything,
does par do, that it needs to be a keyword and statement rather than a
function? What does it do that (say) a parallel version of map() wouldn't
do?

The OP also suggested:

"There could also be parallel versions of map, filter and reduce
provided."

It makes sense to talk about parallelizing map(), because you can
allocate a list of the right size to slot the results into as they become
available. I'm not so sure about filter(), unless you give up the
requirement that the filtered results occur in the same order as the
originals.

But reduce()? I can't see how you can parallelize reduce(). By its
nature, it has to run sequentially: it can't operate on the nth item
until it is operated on the (n-1)th item.

It can calculate the items in parallel, but the final result must be
calculated sequence, although if the final operation is commutative then
some of them could be done in parallel.

Diez B. Roggisch · May 17, 2009

But reduce()? I can't see how you can parallelize reduce(). By its

nature, it has to run sequentially: it can't operate on the nth item
until it is operated on the (n-1)th item.

That depends on the operation in question. Addition for example would
work. My math-skills are a bit too rusty to qualify the exact nature of
the operation, commutativity springs to my mind.

Diez

Roy Smith · May 17, 2009

Steven D'Aprano said:
But reduce()? I can't see how you can parallelize reduce(). By its
nature, it has to run sequentially: it can't operate on the nth item
until it is operated on the (n-1)th item.

Well, if you're willing to impose the additional constraint that f() must
be associative, then you could load the items into a tree, and work your
way up from the bottom of the tree, applying f() pairwise to the left and
right child of each node, propagating upward.

It would take k1 * O(n) to create the (unsorted) tree, and if all the pairs
in each layer really could be done in parallel, k2 * O(lg n) to propagate
the intermediate values. As long as k2 is large compared to k1, you win.

Of course, if the items are already in some random-access container (such
as a list), you don't even need to do the first step, but in the general
case of generating the elements on the fly with an iterable, you do. Even
with an iterable, you could start processing the first elements while
you're still generating the rest of them, but that gets a lot more
complicated and assuming k2 >> k1, of limited value.

If k2 is about the same as k1, then the whole thing is pointless.

But, this would be something to put in a library function, or maybe a
special-purpose Python derivative, such as numpy. Not in the core language.

Gary Herron · May 17, 2009

MRAB said:
It can calculate the items in parallel, but the final result must be
calculated sequence, although if the final operation is commutative then
some of them could be done in parallel.

That should read "associative" not "commutative".

For instance A+B+C+D could be calculated sequentially as implied by
((A+B)+C)+D
or with some parallelism as implied by
(A+B)+(C+D)
That's an application of the associativity of addition.

Gary Herron

Steven D'Aprano · May 17, 2009

That depends on the operation in question. Addition for example would
work.

You'd think so, but you'd be wrong. You can't assume addition is always
commutative.

1.0

My math-skills are a bit too rusty to qualify the exact nature of
the operation, commutativity springs to my mind.

And how is reduce() supposed to know whether or not some arbitrary
function is commutative?

Steven D'Aprano · May 17, 2009

It can calculate the items in parallel,

I don't understand what calculation you are talking about. Let's take a
simple example:

reduce(operator.sub, [100, 50, 25, 5]) => 100-50-25-5 = 20

What calculations do you expect to do in parallel?

but the final result must be
calculated sequence, although if the final operation is commutative then
some of them could be done in parallel.

But reduce() can't tell whether the function being applied is commutative
or not. I suppose it could special-case a handful of special cases (e.g.
operator.add for int arguments -- but not floats!) or take a caller-
supplied argument that tells it whether the function is commutative or
not. But in general, you can't assume the function being applied is
commutative or associative, so unless you're willing to accept undefined
behaviour, I don't see any practical way of parallelizing reduce().

Diez B. Roggisch · May 17, 2009

And how is reduce() supposed to know whether or not some arbitrary
function is commutative?

I don't recall anybody saying it should know that - do you? The OP wants
to introduce parallel variants, not replace the existing ones.

Diez

MRAB · May 17, 2009

Steven said:
It can calculate the items in parallel,

Click to expand...

I don't understand what calculation you are talking about. Let's take a
simple example:

reduce(operator.sub, [100, 50, 25, 5]) => 100-50-25-5 = 20

What calculations do you expect to do in parallel?

but the final result must be
calculated sequence, although if the final operation is commutative then
some of them could be done in parallel.

Click to expand...

But reduce() can't tell whether the function being applied is commutative
or not. I suppose it could special-case a handful of special cases (e.g.
operator.add for int arguments -- but not floats!) or take a caller-
supplied argument that tells it whether the function is commutative or
not. But in general, you can't assume the function being applied is
commutative or associative, so unless you're willing to accept undefined
behaviour, I don't see any practical way of parallelizing reduce().

I meant associative not commutative.

I was thinking about calculating the sum of a list of expressions, where
the expressions could be calculated in parallel.

Diez B. Roggisch · May 17, 2009

But reduce() can't tell whether the function being applied is commutative

or not. I suppose it could special-case a handful of special cases (e.g.
operator.add for int arguments -- but not floats!) or take a caller-
supplied argument that tells it whether the function is commutative or
not. But in general, you can't assume the function being applied is
commutative or associative, so unless you're willing to accept undefined
behaviour, I don't see any practical way of parallelizing reduce().

def reduce(operation, sequence, startitem=None, parallelize=False)

should be enough. Approaches such as OpenMP also don't guess, they use
explicit annotations.

Diez

Paul Boddie · May 17, 2009

From a user point of view I think that adding a 'par' construct to
Python for parallel loops would add a lot of power and simplicity,
e.g.

par i in list:
updatePartition(i)

You can do this right now with a small amount of work to make
updatePartition a callable which works in parallel, and without the
need for extra syntax. For example, with the pprocess module, you'd
use boilerplate like this:

import pprocess
queue = pprocess.Queue(limit=ncores)
updatePartition = queue.manage(pprocess.MakeParallel
(updatePartition))

(See http://www.boddie.org.uk/python/pprocess/tutorial.html#Map for
details.)

At this point, you could use a normal "for" loop, and you could then
"sync" for results by reading from the queue. I'm sure it's a similar
story with the multiprocessing/processing module.

There would be no locking and it would be the programmer's
responsibility to ensure that the loop was truly parallel and correct.

Yes, that's the idea.

The intention of this would be to speed up Python execution on multi-
core platforms. Within a few years we will see 100+ core processors as
standard and we need to be ready for that.

In what sense are we not ready? Perhaps the abstractions could be
better, but it's definitely possible to run Python code on multiple
cores today and get decent core utilisation.

There could also be parallel versions of map, filter and reduce
provided.

Yes, that's what pprocess.pmap is for, and I imagine that other
solutions offer similar facilities.

BUT...none of this would be possible with the current implementation
of Python with its Global Interpreter Lock, which effectively rules
out true parallel processing.

See:http://jessenoller.com/2009/02/01/python-threads-and-the-global-inter....

What do others think?

That your last statement is false: true parallel processing is
possible today. See the Wiki for a list of solutions:

http://wiki.python.org/moin/ParallelProcessing

In addition, Jython and IronPython don't have a global interpreter
lock, so you have the option of using threads with those
implementations, too.

Paul

Steven D'Aprano · May 18, 2009

I don't recall anybody saying it should know that - do you? The OP wants
to introduce parallel variants, not replace the existing ones.

Did I really need to spell it out? From context, I'm talking about the
*parallel version* of reduce().

Steven D'Aprano · May 18, 2009

def reduce(operation, sequence, startitem=None, parallelize=False)

should be enough. Approaches such as OpenMP also don't guess, they use
explicit annotations.

It would be nice if the OP would speak up and tell us what he intended,
so we didn't have to guess what he meant. We're getting further and
further away from his original suggestion of a "par" loop.

If you pass parallize=True, then what? Does it assume that operation is
associative, or take some steps to ensure that it is? Does it guarantee
to perform the operations in a specific order, or will it potentially
give non-deterministic results depending on the order that individual
calculations come back?

As I said earlier, parallelizing map() sounds very plausible to me, but
the approaches that people have talked about for parallelizing reduce()
so far sound awfully fragile and magically to me. But at least I've
learned one thing: given an associative function, you *can* parallelize
reduce using a tree. (Thanks Roy!)

jeremy · May 18, 2009

You can do this right now with a small amount of work to make
updatePartition a callable which works in parallel, and without the
need for extra syntax. For example, with the pprocess module, you'd
use boilerplate like this:

import pprocess
queue = pprocess.Queue(limit=ncores)
updatePartition = queue.manage(pprocess.MakeParallel
(updatePartition))

(Seehttp://www.boddie.org.uk/python/pprocess/tutorial.html#Mapfor
details.)

At this point, you could use a normal "for" loop, and you could then
"sync" for results by reading from the queue. I'm sure it's a similar
story with the multiprocessing/processing module.

Yes, that's the idea.

In what sense are we not ready? Perhaps the abstractions could be
better, but it's definitely possible to run Python code on multiple
cores today and get decent core utilisation.

Yes, that's what pprocess.pmap is for, and I imagine that other
solutions offer similar facilities.

That your last statement is false: true parallel processing is
possible today. See the Wiki for a list of solutions:

http://wiki.python.org/moin/ParallelProcessing

In addition, Jython and IronPython don't have a global interpreter
lock, so you have the option of using threads with those
implementations, too.

Paul

Hi Paul and others,

Thanks for your responses to my original questions.

Paul, thanks for explaining about the pprocess module which appears
very useful. I presume that this is using multiple operating system
processes rather than threads which would probably imply that it is
suitable for coarse grained parallel programming rather than fine-
grained because of overhead in starting up new processes and sharing
objects. (How is that done, by the way?). It probably has advantages
and disadvantages compared with thread based parallelism.

My suggestion is primarily about using multiple threads and sharing
memory - something akin to the OpenMP directives that one of you has
mentioned. To do this efficiently would involve removing the Global
Interpreter Lock, or switching to Jython or Iron Python as you
mentioned.

However I *do* actually want to add syntax to the language. I think
that 'par' makes sense as an official Python construct - we already
have had this in the Occam programming language for twenty-five years.
The reason for this is ease of use. I would like to make it easy for
amateur programmers to exploit natural parallelism in their
algorithms. For instance somebody who wishes to calculate a property
of each member from a list of chemical structures using the Python
Daylight interface: with my suggestion they could potentially get a
massive speed up just by changing 'for' to 'par' or 'map' to 'pmap'.
(Or map with a parallel keyword argument set as suggested). At present
they would have to manually chop up their work and run it as multiple
processes in order to achieve the same - fine for expert programmers
but not reasonable for people working in other domains who wish to use
Python as a utility because of its fantastic productivity and ease of
use.

Let me clarify what I think par, pmap, pfilter and preduce would mean
and how they would be implemented. A par loop is like a for loop,
however the programmer is saying that the order in which the
iterations are performed doesn't matter and they might be performed in
parallel. The python system then has the option to allocate a number
of threads to the task and share out the iterations accordingly
between the threads. (It might be that the programmer should be
allowed to explictly define the number of threads to use or can
delegate that decision to the system). Parallel pmap and pfilter would
be implemented in much the same way, although the resultant list might
have to be reassembled from the partial results returned from each
thread. As people have pointed out, parallel reduce is a tricky option
because it requires the binary operation to be associative in which
case it can be parallelised by calculating the result using a tree-
based evaluation strategy.

I have used all of OpenMP, MPI, and Occam in the past. OpenMP adds
parallelism to programs by the use of special comment strings, MPI by
explicit calls to library routines, and Occam by explicit syntactical
structures. Each has its advantages. I like the simplicity of OpenMP,
the cross-language portability of MPI and the fact the concurrency is
built in to the Occam language. What I am proposing here is a hybrid
of the OpenMP and Occam approaches - a change to the language which is
very natural and yet is easy for programmers to understand.
Concurrency is generally regarded as the hardest concept for
programmers to grasp.

Jeremy

Diez B. Roggisch · May 18, 2009

Steven said:
Did I really need to spell it out? From context, I'm talking about the
*parallel version* of reduce().

From what context - your permanent arguing that there is no way for reduce
to determine if the passed operation is associative (which no one claimed
to be possible, nor to be needed)? No, from *that* context it wasn't clear.

It's the programmer who decides which version to use.

Diez

George Sakkis · May 18, 2009

My suggestion is primarily about using multiple threads and sharing
memory - something akin to the OpenMP directives that one of you has
mentioned. To do this efficiently would involve removing the Global
Interpreter Lock, or switching to Jython or Iron Python as you
mentioned.

However I *do* actually want to add syntax to the language.

Good luck with that. The GIL is not going away any time soon (or
probably ever) and as long as CPython is the "official"
implementation, there are almost zero chances of adding syntax support
for this. Besides, Guido and other py-devs are not particularly keen
on threads as a parallelization mechanism.

George

First steps in setting up VSCode to work with Python.	2	Mar 13, 2023
Possibly better loop construct, also labels+goto important and on the fly compiler idea.	82	Oct 17, 2013
Processing in Python help	0	Aug 31, 2022
Trying to modify a list in Python, receiving "variable not defined" error	3	Feb 6, 2023
how to construct a binary-tree using python?	4	May 6, 2006
subprocess.Popen and multiprocessing fails to execute external program	0	Jan 10, 2013
Python Front-end to GCC	108	Oct 20, 2013
Would a Python 2.8 help you port to Python 3?	1	Jun 3, 2014

Adding a Par construct to Python?

jeremy

jeremy

Steven D'Aprano

bearophileHUGS

Steven D'Aprano

MRAB

Diez B. Roggisch

Roy Smith

Gary Herron

Steven D'Aprano

Steven D'Aprano

Diez B. Roggisch

MRAB

Diez B. Roggisch

Paul Boddie

Steven D'Aprano

Steven D'Aprano

jeremy

Diez B. Roggisch

George Sakkis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads