os.wait() losing child?

Jason Zheng · Jul 10, 2007

This may be a silly question but is possible for os.wait() to lose track
of child processes? I'm running Python 2.4.4 on Linux kernel 2.6.20
(i686), gcc4.1.1, and glibc-2.5.

Here's what happened in my situation. I first created a few child
processes with Popen, then in a while(True) loop wait on any of the
child process to exit, then restart a child process:

import os
from subprocess import Popen

pids = {}

for i in xrange(3):
p = Popen('sleep 1', shell=True, cwd='/home/user',
stdout=file(os.devnull,'w'))
pids[p.pid] = i

while (True):
pid = os.wait()
i = pids[pid]
del pids[pid]
print "Child Process %d terminated, restarting" % i
if (someCondition):
break
p = Popen('sleep 1', shell=True, cwd='/home/user',
stdout=file(os.devnull,'w'))
pids[p.pid] = i

As I started to run this program, soon I discovered that some of the
processes stopped showing up, and eventually os.wait() will give an
error saying that there's no more child process to wait on. Can anyone
tell me what I did wrong?

greg · Jul 10, 2007

Jason said:
while (True):
pid = os.wait()
...
if (someCondition):
break
> ...

Are you sure that someCondition() always becomes true
when the list of pids is empty? If not, you may end
up making more wait() calls than there are children.

It might be safer to do

while pids:
...

Jason Zheng · Jul 10, 2007

Hate to reply to my own thread, but this is the working program that can
demonstrate what I posted earlier:

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]

for i in xrange(3):
p = Popen('sleep 1', shell=True, cwd='/home',
stdout=file(os.devnull,'w'))
pids[p.pid] = i
print "Starting child process %d (%d)" % (i,p.pid)

while (True):
(pid,exitstat) = os.wait()
i = pids[pid]
del pids[pid]
counts=counts+1

#terminate if count>10
if (counts==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i
p = Popen('sleep 1', shell=True, cwd='/home',
stdout=file(os.devnull,'w'))
pids[p.pid] = i

Jason Zheng · Jul 10, 2007

greg said:
Are you sure that someCondition() always becomes true
when the list of pids is empty? If not, you may end
up making more wait() calls than there are children.

Regardless of the nature of the someCondition, what I see from the print
output of my python program is that some child processes never triggers
the unblocking of os.wait() call.

~Jason

greg · Jul 11, 2007

Jason said:
Hate to reply to my own thread, but this is the working program that can
demonstrate what I posted earlier:

I've figured out what's going on. The Popen class has a
__del__ method which does a non-blocking wait of its own.
So you need to keep the Popen instance for each subprocess
alive until your wait call has cleaned it up.

The following version seems to work okay.

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]
p = [None, None, None]

for i in xrange(3):
p = Popen('sleep 1', shell=True)
pids[p.pid] = i
print "Starting child process %d (%d)" % (i,p.pid)

while (True):
(pid,exitstat) = os.wait()
i = pids[pid]
del pids[pid]
counts=counts+1

#terminate if count>10
if (counts==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d (%d) terminated, restarting" % (i, pid),
p = Popen('sleep 1', shell=True)
pids[p.pid] = i
print "(%d)" % p.pid

Jason Zheng · Jul 11, 2007

Greg,

That explains it! Thanks a lot for your help. I guess this is something
they do to prevent zombie threads?

~Jason

Jason said:
Jason said:

Hate to reply to my own thread, but this is the working program that
can demonstrate what I posted earlier:

Click to expand...

I've figured out what's going on. The Popen class has a
__del__ method which does a non-blocking wait of its own.
So you need to keep the Popen instance for each subprocess
alive until your wait call has cleaned it up.

The following version seems to work okay.

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]
p = [None, None, None]

for i in xrange(3):
p = Popen('sleep 1', shell=True)
pids[p.pid] = i
print "Starting child process %d (%d)" % (i,p.pid)

while (True):
(pid,exitstat) = os.wait()
i = pids[pid]
del pids[pid]
counts=counts+1

#terminate if count>10
if (counts==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d (%d) terminated, restarting" % (i, pid),
p = Popen('sleep 1', shell=True)
pids[p.pid] = i
print "(%d)" % p.pid

Jason Zheng · Jul 11, 2007

greg said:
I've figured out what's going on. The Popen class has a
__del__ method which does a non-blocking wait of its own.
So you need to keep the Popen instance for each subprocess
alive until your wait call has cleaned it up.

The following version seems to work okay.

It still doesn't work on my machine. I took a closer look at the Popen
class, and I think the problem is that the __init__ method always calls
a method _cleanup, which polls every existing Popen instance. The poll
method does a nonblocking wait.

If one of my child process finishes as I create a new Popen instance,
then the _cleanup method effectively de-zombifies the child process, so
I can no longer expect to see the return of that pid on os.wait() any more.

~Jason

Matthew Woodcraft · Jul 11, 2007

greg said:
I've figured out what's going on. The Popen class has a
__del__ method which does a non-blocking wait of its own.
So you need to keep the Popen instance for each subprocess
alive until your wait call has cleaned it up.

I don't think this will be enough for the poster, who has Python 2.4:
in that version, opening a new Popen object would trigger the wait on
all 'outstanding' Popen-managed subprocesses.

It seems to me that subprocess.py assumes that it will do all wait()ing
on its children itself; I'm not sure if it's safe to rely on the
details of how this is currently arranged.

Perhaps a better way would be for subprocess.py to provide its own
variant of os.wait() for people who want 'wait-for-any-child' (though
it would be hard to support programs which also had children not
managed by subprocess.py).

-M-

Jason Zheng · Jul 11, 2007

Matthew said:
I don't think this will be enough for the poster, who has Python 2.4:
in that version, opening a new Popen object would trigger the wait on
all 'outstanding' Popen-managed subprocesses.

It seems to me that subprocess.py assumes that it will do all wait()ing
on its children itself; I'm not sure if it's safe to rely on the
details of how this is currently arranged.

Perhaps a better way would be for subprocess.py to provide its own
variant of os.wait() for people who want 'wait-for-any-child' (though
it would be hard to support programs which also had children not
managed by subprocess.py).

-M-

Thanks, that's exactly what I need, my program really needs the
os.wait() to be reliable. Perhaps I could pass a flag to Popen to tell
it to never os.wait() on the new pid (but it's ok to os.wait() on other
Popen instances upon _cleanup()).

Nick Craig-Wood · Jul 11, 2007

Jason Zheng said:
It still doesn't work on my machine. I took a closer look at the Popen
class, and I think the problem is that the __init__ method always calls
a method _cleanup, which polls every existing Popen instance. The poll
method does a nonblocking wait.

If one of my child process finishes as I create a new Popen instance,
then the _cleanup method effectively de-zombifies the child process, so
I can no longer expect to see the return of that pid on os.wait()
any more.

The problem you are having is you are letting Popen do half the job
and doing the other half yourself.

Here is a way which works, done completely with Popen. Polling the
subprocesses is slightly less efficient than using os.wait() but does
work. In practice you want to do this anyway to see if your children
exceed their time limits etc.

import os
import time
from subprocess import Popen

processes = []
counts = [0,0,0]

for i in xrange(3):
p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))
processes.append(p)
print "Starting child process %d (%d)" % (i, p.pid)

while (True):
for i,p in enumerate(processes):
exitstat = p.poll()
pid = p.pid
if exitstat is not None:
break
else:
time.sleep(0.1)
continue
counts=counts+1

#terminate if count>10
if (counts==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i
processes = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))

Jason Zheng · Jul 11, 2007

Nick said:
The problem you are having is you are letting Popen do half the job
and doing the other half yourself.

Except that I never wanted Popen to do any thread management for me to
begin with. Popen class has advertised itself as a replacement for
os.popen, popen2, popen4, and etc., and IMHO it should leave the
clean-up to the users, or at least leave it as an option.

Here is a way which works, done completely with Popen. Polling the
subprocesses is slightly less efficient than using os.wait() but does
work. In practice you want to do this anyway to see if your children
exceed their time limits etc.

I think your polling way works; it seems there no other way around this
problem other than polling or extending Popen class.

thanks,

Jason

Nick Craig-Wood · Jul 12, 2007

Jason Zheng said:
Except that I never wanted Popen to do any thread management for me to
begin with. Popen class has advertised itself as a replacement for
os.popen, popen2, popen4, and etc., and IMHO it should leave the
clean-up to the users, or at least leave it as an option.

I think your polling way works; it seems there no other way around this
problem other than polling or extending Popen class.

I think polling is probably the right way of doing it...

Internally subprocess uses os.waitpid(pid) just waiting for its own
specific pids. IMHO this is the right way of doing it other than
os.wait() which waits for any pids. os.wait() can reap children that
you weren't expecting (say some library uses os.system())...

Hrvoje Niksic · Jul 12, 2007

Nick Craig-Wood said:
I think polling is probably the right way of doing it...

It requires the program to wake up every 0.1s to poll for freshly
exited subprocesses. That doesn't consume excess CPU cycles, but it
does prevent the kernel from swapping it out when there is nothing to
do. Sleeping in os.wait allows the operating system to know exactly
what the process is waiting for, and to move it out of the way until
those conditions are met. (Pedants would also notice that polling
introduces on average 0.1/2 seconds delay between the subprocess dying
and the parent reaping it.)

In general, a program that waits for something should do so in a
single call to the OS. OP's usage of os.wait was exactly correct.

Fortunately the problem can be worked around by hanging on to Popen
instances until they are reaped. If all of them are kept referenced
when os.wait is called, they will never end up in the _active list
because the list is only populated in Popen.__del__.

Internally subprocess uses os.waitpid(pid) just waiting for its own
specific pids. IMHO this is the right way of doing it other than
os.wait() which waits for any pids. os.wait() can reap children
that you weren't expecting (say some library uses os.system())...

system calls waitpid immediately after the fork. This can still be a
problem for applications that call wait in a dedicated thread, but the
program can always ignore the processes it doesn't know anything
about.

Hrvoje Niksic · Jul 12, 2007

Jason Zheng said:
It still doesn't work on my machine. I took a closer look at the Popen
class, and I think the problem is that the __init__ method always
calls a method _cleanup, which polls every existing Popen
instance.

Actually, it's not that bad. _cleanup only polls the instances that
are no longer referenced by user code, but still running. If you hang
on to Popen instances, they won't be added to _active, and __init__
won't reap them (_active is only populated from Popen.__del__).

This version is a trivial modification of your code to that effect.
Does it work for you?

#!/usr/bin/python

import os
from subprocess import Popen

pids = {}
counts = [0,0,0]

for i in xrange(3):
p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))
pids[p.pid] = p, i
print "Starting child process %d (%d)" % (i,p.pid)

while (True):
pid, ignored = os.wait()
try:
p, i = pids[pid]
except KeyError:
# not one of ours
continue
del pids[pid]
counts += 1

#terminate if count>10
if (counts==10):
print "Child Process %d terminated." % i
if reduce(lambda x,y: x and (y>=10), counts):
break
continue

print "Child Process %d terminated, restarting" % i
p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os.devnull,'w'))
pids[p.pid] = p, i

Jason Zheng · Jul 12, 2007

Hrvoje said:
Actually, it's not that bad. _cleanup only polls the instances that
are no longer referenced by user code, but still running. If you hang
on to Popen instances, they won't be added to _active, and __init__
won't reap them (_active is only populated from Popen.__del__).

Perhaps that's the difference between Python 2.4 and 2.5. In 2.4,
Popen's __init__ always appends self to _active:

def __init__(...):
_cleanup()
...
self._execute_child(...)
...
_active.append(self)

This version is a trivial modification of your code to that effect.
Does it work for you?

Nope it still doesn't work. I'm running python 2.4.4, tho.

$ python test.py
Starting child process 0 (26497)
Starting child process 1 (26498)
Starting child process 2 (26499)
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated, restarting
Child Process 2 terminated.
Traceback (most recent call last):
File "test.py", line 15, in ?
pid, ignored = os.wait()
OSError: [Errno 10] No child processes

Nick Craig-Wood · Jul 12, 2007

Hrvoje Niksic said:
It requires the program to wake up every 0.1s to poll for freshly
exited subprocesses. That doesn't consume excess CPU cycles, but it
does prevent the kernel from swapping it out when there is nothing to
do. Sleeping in os.wait allows the operating system to know exactly
what the process is waiting for, and to move it out of the way until
those conditions are met. (Pedants would also notice that polling
introduces on average 0.1/2 seconds delay between the subprocess dying
and the parent reaping it.)

Sure!

You could get rid of this by sleeping until a SIGCHLD arrived maybe.

In general, a program that waits for something should do so in a
single call to the OS. OP's usage of os.wait was exactly correct.

Disagree for the reason below.

system calls waitpid immediately after the fork.

os.system probably wasn't the best example, but you take my point I
think!

This can still be a problem for applications that call wait in a
dedicated thread, but the program can always ignore the processes
it doesn't know anything about.

Ignoring them isn't good enough because it means that the bit of code
which was waiting for that process to die with os.getpid() will never
get called, causing a deadlock in that bit of code.

What is really required is a select() like interface to wait which
takes more than one pid. I don't think there is such a thing though,
so polling is your next best option.

Matthew Woodcraft · Jul 12, 2007

Perhaps that's the difference between Python 2.4 and 2.5. In 2.4,
Popen's __init__ always appends self to _active:

Yes, that changed between 2.4 and 2.5.

Note that if you take a copy of 2.5's subprocess.py, it ought to work
fine with 2.4.

-M-

Jason Zheng · Jul 12, 2007

Nick said:
Sure!

You could get rid of this by sleeping until a SIGCHLD arrived maybe.

Yah, I could also just dump Popen class and use fork(). But then what's
the point of having an abstraction layer any more?

Ignoring them isn't good enough because it means that the bit of code
which was waiting for that process to die with os.getpid() will never
get called, causing a deadlock in that bit of code.

Are you talking about something like os.waitpid(os.getpid())? If the
process has completed and de-zombified by another os.wait() call, I
thought it would just throw an exception; it won't cause a deadlock by
hanging the process.

~Jason

Hrvoje Niksic · Jul 12, 2007

Nick Craig-Wood said:
Ignoring them isn't good enough because it means that the bit of
code which was waiting for that process to die with os.getpid() will
never get called, causing a deadlock in that bit of code.

It won't deadlock, it will get an ECHILD or equivalent error because
it's waiting for a PID that doesn't correspond to a running child
process. I agree that this can be a problem if and when you use
libraries that can call system. (In that case sleeping for SIGCHLD is
probably a good solution.)

What is really required is a select() like interface to wait which
takes more than one pid. I don't think there is such a thing
though, so polling is your next best option.

Except for the problems outlined in my previous message. And the fact
that polling becomes very expensive (O(n) per check) once the number
of processes becomes large. Unless one knows that a library can and
does call system, wait is the preferred solution.

Hrvoje Niksic · Jul 13, 2007

Jason Zheng said:
Hrvoje said:

Actually, it's not that bad. _cleanup only polls the instances that
are no longer referenced by user code, but still running. If you hang
on to Popen instances, they won't be added to _active, and __init__
won't reap them (_active is only populated from Popen.__del__).

Click to expand...

Perhaps that's the difference between Python 2.4 and 2.5. [...]
Nope it still doesn't work. I'm running python 2.4.4, tho.

That explains it, then, and also why greg's code didn't work. You
still have the option to try to run 2.5's subprocess.py under 2.4.

Communicating between processes	0	May 13, 2023
How to get a PID of a child process from a process openden with Popen()	0	Apr 8, 2011
can't delete from a dictionary in a loop	10	May 16, 2008
confused with os.fork()	3	Nov 25, 2009
pexpect on windows - child process of another child process - quickquestion	0	Mar 9, 2013
subprocess question re waiting	4	Apr 8, 2013
Subprocess timeout	0	Feb 28, 2007
Multiple process output	0	Aug 12, 2011

os.wait() losing child?

Jason Zheng

greg

Jason Zheng

Jason Zheng

greg

Jason Zheng

Jason Zheng

Matthew Woodcraft

Jason Zheng

Nick Craig-Wood

Jason Zheng

Nick Craig-Wood

Hrvoje Niksic

Hrvoje Niksic

Jason Zheng

Nick Craig-Wood

Matthew Woodcraft

Jason Zheng

Hrvoje Niksic

Hrvoje Niksic

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads