help with timed command, CHLD signals, return codes

A

axeman

Can anyone help me understand the behavior of the following code and
recommend a course of action? Thank you.


Background:

I have a daemon process which repeatedly spawns child processes with a
limited time to live, and collects the return code and output. It uses
the typical mechanisms outlined in Programming Perl and the Perl
Cookbook. Recently, over time, commands which return 0 "degrade" to
returning 16777215 each time (-1 before shift). I have witnessed this
on AIX 4.3 + Perl 5.005_03 and AIX 5.3 + Perl 5.8.2.

Narrowing it down to the simplified code snippet below, the convention
marked by line C: is what I have been using in the daemon. Research
has suggested a CHLD signal handler may be required, but when I use
either lines A: or B:, I get a return code of 16777215 every time.
Lastly, I do not want to use non-core modules, as this tool is widely
used and I don't want to mandate additional build/install. Thanks -
appreciate any insight/guidance.

==================================================================

#!/bin/perl

$SIG{'ALRM'} = \&timed_out;

use POSIX ":sys_wait_h";
#A: $SIG{'CHLD'} = \&REAPER;
#B: $SIG{'CHLD'} = 'IGNORE';

while(1){
($rc,@out) = eval {
alarm(5);
@eout = `echo hi 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
print "timed out\n";
}
print "rc = $rc, out = @out\n";
C: while( ($stiff = waitpid(-1,&WNOHANG)) > 0 ) {}
sleep(1);
}

sub REAPER {
my $stiff;
while( ($stiff = waitpid(-1,&WNOHANG)) > 0 ) {
# handle if desired
}
}
sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}
 
X

xhoster

axeman said:
Can anyone help me understand the behavior of the following code and
recommend a course of action? Thank you.

Background:

I have a daemon process which repeatedly spawns child processes with a
limited time to live, and collects the return code and output. It uses
the typical mechanisms outlined in Programming Perl and the Perl
Cookbook. Recently, over time, commands which return 0 "degrade" to
returning 16777215 each time (-1 before shift). I have witnessed this
on AIX 4.3 + Perl 5.005_03 and AIX 5.3 + Perl 5.8.2.

Narrowing it down to the simplified code snippet below, the convention
marked by line C: is what I have been using in the daemon. Research
has suggested a CHLD signal handler may be required,

What research is that? You are probably misunderstanding something,
but without knowing what it is, we can't correct it.
but when I use
either lines A: or B:, I get a return code of 16777215 every time.

Yes, if you intercept and throw away the exit value before it can get put
into $?, then you won't be able to get the correct value from $?.
while(1){
($rc,@out) = eval {
alarm(5);
@eout = `echo hi 2>&1`;
$erc = ($? >> 8);

Ask Perl to tell you what (the system told it) went wrong:

$erc = "error ? is $? and ! is $!";
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
print "timed out\n";
}

What happens if $@ is defined but does not match /^timed out/?
print "rc = $rc, out = @out\n";
C: while( ($stiff = waitpid(-1,&WNOHANG)) > 0 ) {}

backticks automatically wait for the child, so this waitpid is unnecessary.

Xho
 
A

axeman

What research is that? You are probably misunderstanding something,
but without knowing what it is, we can't correct it.


Only meant to suggest that I had referenced the books mentioned and
another
forum before posting here.

Yes, if you intercept and throw away the exit value before it can get put
into $?, then you won't be able to get the correct value from $?.


Ask Perl to tell you what (the system told it) went wrong:

$erc = "error ? is $? and ! is $!";


Ok, that returns "error ? is 0 and ! is Illegal seek". Not sure what
your point is here - according to Programming Perl, $? is the status
word returned
by the wait for the backtick'd command, and ($?>>8) is the exit value
of
the subprocess, which is what I want.

What happens if $@ is defined but does not match /^timed out/?


Then things fall through to other code not included in the simplified
example.

backticks automatically wait for the child, so this waitpid is unnecessary.


But the alarm timer pop is the whole point of allowing control back if
the
subprocess doesn't return in time. At some point, the signal will
come.
The waitpid reaps these zombies.


Could you recommend alternative code which "repeatedly spawns child
processes
with a limited time to live, and collects the return code and output"?


Thanks...
 
C

Charles DeRykus

axeman said:
Can anyone help me understand the behavior of the following code and
recommend a course of action? Thank you.


Background:

I have a daemon process which repeatedly spawns child processes with a
limited time to live, and collects the return code and output. It uses
the typical mechanisms outlined in Programming Perl and the Perl
Cookbook. Recently, over time, commands which return 0 "degrade" to
returning 16777215 each time (-1 before shift). I have witnessed this
on AIX 4.3 + Perl 5.005_03 and AIX 5.3 + Perl 5.8.2.

Narrowing it down to the simplified code snippet below, the convention
marked by line C: is what I have been using in the daemon. Research
has suggested a CHLD signal handler may be required, but when I use
either lines A: or B:, I get a return code of 16777215 every time.
Lastly, I do not want to use non-core modules, as this tool is widely
used and I don't want to mandate additional build/install. Thanks -
appreciate any insight/guidance.

==================================================================

#!/bin/perl

$SIG{'ALRM'} = \&timed_out;

use POSIX ":sys_wait_h";
#A: $SIG{'CHLD'} = \&REAPER;
#B: $SIG{'CHLD'} = 'IGNORE';

while(1){
($rc,@out) = eval {
alarm(5);
@eout = `echo hi 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
print "timed out\n";
}
print "rc = $rc, out = @out\n";
C: while( ($stiff = waitpid(-1,&WNOHANG)) > 0 ) {}
sleep(1);
}

sub REAPER {
my $stiff;
while( ($stiff = waitpid(-1,&WNOHANG)) > 0 ) {
# handle if desired
}
}
sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}

The backticked command forks under the covers and may be
reaping daemon children. Setting SIGCHLD to 'IGNORE' will
cause the children to be auto-reaped too and you'll get a
-1 ("No child processes") returns.


A better idiom for the child handler is simply to loop
inside the handler and ignore child processes that've
been reaped elsewhere:


$SIG{ CHLD } = sub { 1 while waitpid( -1, WNOHANG ) > 0; };

The "Perl Cookbook" by T.Christiansen, has some helpful info
about reaping processes in the "zombie" discussion.

hth,
 
X

xhoster

axeman said:
Ok, that returns "error ? is 0 and ! is Illegal seek".
Not sure what
your point is here -

I thought the reason for your post was that $? was sometimes giving -1,
not 0. The point is that when $? is not 0, $! may tell you why it
isn't 0, which will help fix the problem.

Which brings up something I forgot to ask last time. Does the simplified
code you posted recapitulate the problem? i.e. does it eventually start
returning setting to $? to -1 rather than 0? Or does that only happen in
the full code? Also, what is taking so long in the real code--is the child
spinning the CPU, or is it waiting on network IO that never comes?
according to Programming Perl, $? is the status
word returned
by the wait for the backtick'd command, and ($?>>8) is the exit value
of
the subprocess, which is what I want.

Er, if you are getting what you want, what is the problem again?
But the alarm timer pop is the whole point of allowing control back if
the
subprocess doesn't return in time. At some point, the signal will
come.
The waitpid reaps these zombies.

Ah, good point.
Could you recommend alternative code which "repeatedly spawns child
processes
with a limited time to live, and collects the return code and output"?

One problem is that your child processed don't have a limited time to live.
Perl stops paying attention to them after 5 seconds, but that doesn't mean
they go away. So you may want to do a pipe open for reading, which returns
the pid of the child, so that you can kill the child if it times out.
Look in "perldoc perlipc". Maybe something like this:

warn "untested code";
my $pid = open(my $fh, "-|") or die $!;
if ($pid) { # parent
my ($rc,@out) =
eval {
alarm(5);
@out = <$fh>;
close $fh;
$rc=$?;
warn $! if $rc;
alarm(0);
return ($rc,@out)
};
if( $@ =~ /^timed out/ ) {
kill 15, $pid;
waitpid $pid;
$rc = 1;
print "timed out\n";
}
#whatever
} else { # child
exec($program, @options, @args)
|| die "can't exec program: $!";
}


Xho
 
A

axeman

I thought the reason for your post was that $? was sometimes giving -1,
not 0. The point is that when $? is not 0, $! may tell you why it
isn't 0, which will help fix the problem.

Gotcha.


Which brings up something I forgot to ask last time. Does the simplified
code you posted recapitulate the problem? i.e. does it eventually start
returning setting to $? to -1 rather than 0? Or does that only happen in
the full code? Also, what is taking so long in the real code--is the child
spinning the CPU, or is it waiting on network IO that never comes?


I admittedly did not run the simplified code until the symptom showed,
but I
tried to carefully rule out non-relevant and non-IPC stuff. The full
code
is a typical remote system availabilty tester (ping script), and may
spawn
a variety of tests - ping, port test, service test, etc. Tests which
don't
return in time result in flagging a system as down.

Er, if you are getting what you want, what is the problem again?


Lol. Simply meant within the context of the line you commented -
thought
you were suggesting using the status word ($?) and error ($!) instead
of
the child exit value ($?>>8) in the tool.

Ah, good point.


One problem is that your child processed don't have a limited time to live.
Perl stops paying attention to them after 5 seconds, but that doesn't mean
they go away. So you may want to do a pipe open for reading, which returns
the pid of the child, so that you can kill the child if it times out.
Look in "perldoc perlipc". Maybe something like this:


Yes, poor choice of words. Should have said the driver limits the time
it
will wait for the child to return. The full code has a mechanism to
kill
off hung children.


I like your recommendation and have implemented it in principal and am
testing now. Thanks a lot for steering me in the direction of a piped
command.
I use those for other things, don't know why it didn't dawn on me for
this.
Really appreciate it!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top