help analyzing cause of return code

A

axeman

Synopsis:

A variant of a typical host availability / pinger script has performed
well for many years. Multiple daemons process various lists at various
intervals with various timeouts. The tool was recently modified to
support attempting sequences of tests (i.e. ping and TCP port test, ...
vs. just one test). The daemons will run fine for days, but then some
will suddenly receive non-zero return codes for every command/test they
perform. Specifically, return code 16777215 (-1 before shift >> 8).
Searches have suggested problems with CHLD signals, though they have
never been a problem before. Appreciate any insight.

Code versions:

AIX 4.3.3
Perl 5.005_03

Basic daemon model:

....

sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}

....

$SIG{'HUP'} = 'IGNORE'; # don't die on these signals
$SIG{'PIPE'} = 'IGNORE';
$SIG{'TERM'} = 'IGNORE';
$SIG{'ALRM'} = \&timed_out;
$SIG{'USR1'} = \&quiesce;
use POSIX ":sys_wait_h";

....

foreach $test ( split(/;/,$TESTS) ) {

# std wrapper for timed operation, return code in $rc,
output in @out
($rc,@out) = eval {
alarm($timeout);
$test =~ s/HOST/$check/g;
$test[$testCount] = $test;
@eout = `$test 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
$timeouts++;
$test_timeout[$testCount] = 1;
}
$test_rc[$testCount] = $rc;
$test_console[$testCount] = join('',@out);
$testCount++;
$spawned++;

last if( $rc == 0 ); # successful test
}

....

# clean up any hung children for every 10 or more spawned
processes

if( $spawned > 10 ) {
reap; # NOTE - also new code - this recursively traverses
the process tree
# and kill KILL's any children
$spawned = 0;
}

# clean up zombies - not done w/signal handler due to unreliable
signals

while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0 ) {}

....
 
U

usenet

axeman said:
vs. just one test). The daemons will run fine for days, but then some
will suddenly receive non-zero return codes for every command/test they
perform.

Is your process reaper reaping? For some odd reason, AIX has an
insanely-low default max-pid-per-user limitation (I think default is
256 - I usually run it at 1024). Check "smitty chgsys" and check your
process table.

You would have messages in /var/spool/mail if you were pid-starved.
And, of course, if the process is running as root, I don't think it
would matter, since (I believe) root is not limited.

FWIW, whatever is happening here probably (almost surely) has nothing
to do with Perl.
 
A

axeman

Thanks David.

Unfortunately, it is running as root (even thought the limit is low -
128 - and no related mail). The reaper is misnamed (not my code), it
just kills hung test procs, but does not reap their exit status, thats
what the asynchronous 'while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0
) {}' line does. CHLD signals are not mapped (i.e. left to DEFAULT).
Curiously, if I do map them to a handler or IGNORE, the bad return code
occurs always.
 
X

xhoster

axeman said:
Synopsis:

A variant of a typical host availability / pinger script has performed
well for many years. Multiple daemons process various lists at various
intervals with various timeouts.

How often are the timeouts actually activated?
The tool was recently modified to
support attempting sequences of tests (i.e. ping and TCP port test, ...
vs. just one test).

Did these changes change how often timeout were actually activated?
AIX 4.3.3
Perl 5.005_03
...
sub timed_out { # ALRM signal handler for command time-out
die "timed out";
}

Does the handler need to re=install itself after being activated
on your system?

($rc,@out) = eval {
alarm($timeout);
$test =~ s/HOST/$check/g;
$test[$testCount] = $test;
@eout = `$test 2>&1`;
$erc = ($? >> 8);
alarm(0);
return ($erc,@eout);
};
if( $@ =~ /^timed out/ ) {
$rc = 1;
$timeouts++;
$test_timeout[$testCount] = 1;
}

If $@ is defined but not timed out, shouldn't you do something about it?

Xho
 
X

xhoster

axeman said:
Thanks David.

Unfortunately, it is running as root (even thought the limit is low -
128 - and no related mail). The reaper is misnamed (not my code), it
just kills hung test procs, but does not reap their exit status, thats
what the asynchronous 'while( ($waitedPid = waitpid(-1, &WNOHANG)) > 0
) {}' line does. CHLD signals are not mapped (i.e. left to DEFAULT).
Curiously, if I do map them to a handler or IGNORE, the bad return code
occurs always.

qx{} automatically waits for the job it spawns--that is how it sets $?.
If you set SIG{CHLD}, it will interfer with qw{}'s wait.

Xho
 
A

axeman

Multiple daemons process various lists at various
How often are the timeouts actually activated?

Rarely, i.e. only when a test fails / system is down, and most are
usually up.
Did these changes change how often timeout were actually activated?
No.

Does the handler need to re=install itself after being activated
on your system?

As mentioned, there is no handler, exit statuses are gathered
asynchronously.
If $@ is defined but not timed out, shouldn't you do something about it?

Yes, clearly. That code was left out (the elipses ...) because it was
not relevant to the problem.
qx{} automatically waits for the job it spawns--that is how it sets $?.
If you set SIG{CHLD}, it will interfer with qw{}'s wait.

Thanks, that makes sense.
 
X

xhoster

Note: snipped material restored with "] ]".

] ] > sub timed_out { # ALRM signal handler for command time-out
] ] > die "timed out";
] ] > }
As mentioned, there is no handler, exit statuses are gathered
asynchronously.

If the thing whose comment says "ALRM signal handler" is not a handler,
then what the heck is it? And why is it commented thusly?

Xho
 
A

axeman

Lol. Thought you meant a handler for CHLD. No, the ALRM handler does
not need to be reinstalled.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top