multiprocessing eats memory

M

Max Ivanov

I'm playing with pyprocessing module and found that it eats lot's of
memory. I've made small test case to show it. I pass ~45mb of data to
worker processes and than get it back slightly modified. At any time
in main process there are shouldn't be no more than two copies of data
(one original data and one result). I run it on 8-core server and top
shows me that main process eats ~220 Mb and worker processes eats 90
-150 mb. Isn't it too much?

Small test-case is uploaded to pastebin: http://pastebin.ca/1210523
 
I

Istvan Albert

At any time in main process there are shouldn't be no more than two copies of data
(one original data and one result).

From the looks of it you are storing a lots of references to various
copies of your data via the async set.
 
R

redbaron

From the looks of it you are storing a lots of references to various
copies of your data via the async set.

How could I avoid of storing them? I need something to check does it
ready or not and retrieve results if ready. I couldn't see the way to
achieve same result without storing asyncs set.
 
M

MRAB

How could I avoid of storing them? I need something to check does it
ready or not and retrieve results if ready. I couldn't see the way to
achieve same result without storing asyncs set.

You could give each worker process an ID and then have them put the ID
into a queue to signal to the main process when finished.

BTW, your test-case modifies the asyncs set while iterating over it,
which is a bad idea.
 
R

redbaron

You could give each worker process an ID and then have them put the ID
into a queue to signal to the main process when finished.
And how could I retrieve result from worker process without async?
BTW, your test-case modifies the asyncs set while iterating over it,
which is a bad idea.
My fault, there was list(asyncs) originally.
 
I

Istvan Albert

How could I avoid of storing them? I need something to check does it
ready or not and retrieve results if ready. I couldn't see the way to
achieve same result without storing asyncs set.

It all depends on what you are trying to do. The issue that you
originally brought up is that of memory consumption.

When processing data in parallel you will use up as much memory as
many datasets you are processing at any given time. If you need to
reduce memory use then you need to start fewer processes and use some
mechanism to distribute the work on them as they become free. (see
recommendation that uses Queues)
 
R

redbaron

When processing data in parallel you will use up as muchmemoryas
many datasets you are processing at any given time.
Worker processes eats 2-4 times more than I pass to them.

If you need to
reducememoryuse then you need to start fewer processes and use some
mechanism to distribute the work on them as they become free. (see
recommendation that uses Queues)
I don't understand how could I use Queue here? If worker process
finish computing, it puts its' id into Queue, in main process I
retrieve that id and how could I retrieve result from worker process
then?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top