waitpid woes on Solaris, Perl 5.8.8

A

A

I am getting intermittent unexpected result from waitpid on Solaris 9
running Perl 5.8.8.

Here is the scenario (the bare bones code is below).

Program_A, written in Perl, is invoked about a million times every
day. Most of the times, it invokes (using fork-exec) Program_B which
is written in C++. Program_A uses waitpid to get the exit code of
Program_B.
It works fine most of the times, but about a few dozen times every
day, the waitpid apparently fails and when it fails, I get

$? is -1
$! is "No child processes"

In all of the cases I have investigated, the child process, Program_B,
started and completed gracefully with "exit(0)" and of course, the pid-
s match from the trace log of both processes.

The output, from the code below, in such case is

Child pid=5196, exitCode=0xffffffff (No child processes)

Program_A itself is transient and short lived, and, depending on its
input, executes Program_B at most once.

What am I doing wrong?
How to detect and correct this?

Thanks for your help.

# ------------------------------------------- begin code
-------------------------------------------------
#!/usr/local/bin/perl

# program_A

my $cpid;
my $ec = undef;
my $em = undef;

sub getChildStatus
{
my $tc = undef;
my $tm = undef;
my $r = undef;

while ( 1 ) {
$r = waitpid($cpid, 0);
$tc = $?;
$em = $!;
last if ( -1 == $r || $r == $cpid );
print STDERR "waitpid($cpid, 0) returned $r ( $? )\n";
}
if ( $cpid == $r ) {
$ec = $tc;
$em = $tm;
}
}

sub sigCLDhandler
{
my $sig = shift;
print STDERR "caught SIG $sig\n";
getChildStatus;
}


sub runIt
{
my $oldSigCld = $SIG{CLD};
local $SIG{CLD} = \&sigChldHandler;

$cpid = fork;
if ( ! defined $cpid ) { print STDERR "fork failed [ $! ]\n";
return; }

if ( 0 == $cpid ) {
print STDERR "child pid $$ starting\n";

exec program_B, .. .. ..;

print STDERR "child pid $$: exec failed [$!], exiting with -1\n";
exit(-1);
} # 0 == $cpid i.e. the child

getChildStatus; # only the parent reaches here
$SIG{CLD} = $oldSigCld ;
} # runIt

#
# main
#
runIt;
if ( $ec ) {
printf STDERR "Child pid=$cpid exitcode=%#08x msg=(%s)\n", $ec, $em;
}

# ------------------------------------------- end code
-------------------------------------------------
 
X

xhoster

A said:
I am getting intermittent unexpected result from waitpid on Solaris 9
running Perl 5.8.8.

Here is the scenario (the bare bones code is below).

Program_A, written in Perl, is invoked about a million times every
day. Most of the times, it invokes (using fork-exec) Program_B which
is written in C++. Program_A uses waitpid to get the exit code of
Program_B.
It works fine most of the times, but about a few dozen times every
day, the waitpid apparently fails and when it fails, I get

$? is -1
$! is "No child processes"

In all of the cases I have investigated, the child process, Program_B,
started and completed gracefully with "exit(0)" and of course, the pid-
s match from the trace log of both processes.

The output, from the code below, in such case is

Child pid=5196, exitCode=0xffffffff (No child processes)

Program_A itself is transient and short lived, and, depending on its
input, executes Program_B at most once.

What am I doing wrong?

You are mucking with $SIG{CLD} when, as far as I can tell, you have
no need to. getChildStatus (and the waitpid in it) can get called twice,
once from the sig handler and once from the runIt. If it does get called
twice, the second time that child no longer exists, as it was already
waited on. Remove the $SIG{CLD} stuff.

Xho
 
A

A

You are mucking with $SIG{CLD} when, as far as I can tell, you have
no need to. getChildStatus (and the waitpid in it) can get called twice,
once from the sig handler and once from the runIt. If it does get called
twice, the second time that child no longer exists, as it was already
waited on. Remove the $SIG{CLD} stuff.

Xho

- Show quoted text -

Thanks for your reply.

First, there's a typo in my original message.

The third line after the while(1) in getChildStatus should be
$tm = $!;
instead of
$em = $!;

Now, to the point that the waitpid could get called twice.

Please note that the code is designed to guard against this, the
assignments to the globals $ec and $em are done if and only if waitpid
returns the matching pid.
So, even if it is called twice, the second time waitpid returns -1,
and then
getChildStatus returns without modifying the globals.

On your advice to remove the $SIG{CLD}, there are 3 statements,

the first statement saves the handler,
the second statement installs the current one needed by this
routine
and the last one re-installs the saved handler.

which one(s) would you suggest I remove?

Yes, there's a deficiency (bug, if you will) in the code. The
$SIG{CLD} should be re-installed if fork fails, but that I think, is
of no consequence to the problem at hand.

Thanks again.
 
X

xhoster

A said:
Thanks for your reply.

First, there's a typo in my original message.

The third line after the while(1) in getChildStatus should be
$tm = $!;
instead of
$em = $!;

Now, to the point that the waitpid could get called twice.

Please note that the code is designed to guard against this, the
assignments to the globals $ec and $em are done if and only if waitpid
returns the matching pid.

The waitpid of one getChildStatus returns the expected pid and sets the
global $? and $!. Before it can do anything else, the waitpid of the other
getChildStatus returns -1 and over writes the global $? and $! with it's
own values, but for this one $r does not meet the if and so returns control
to the first getChildStatus. The first getChildStatus was the right pid
recorded in $r (as that was a lexical and didn't get overwritten), but has
the wrong $? and $! because they did get overwritten, and now those get
recorded into your $tm and $cm
On your advice to remove the $SIG{CLD}, there are 3 statements,

the first statement saves the handler,
the second statement installs the current one needed by this
routine
and the last one re-installs the saved handler.

which one(s) would you suggest I remove?

Probably all of them, but it is not really possible to know from what you
give. We would need to see the code that set the orginal handler that is
getting saved and then restored. If the handler you inherit is necessary,
then why would it be safe to overwrite it with something else for even the
duration of this routine? On the other hand, if the handler you inherit is
not necessary, then what is the point of saving and re-installing it? If
there is no other code which intalls a handler in the first place, then I'd
remove all three of those things. (And even if not, remove at least two,
see below)
Yes, there's a deficiency (bug, if you will) in the code. The
$SIG{CLD} should be re-installed if fork fails, but that I think, is
of no consequence to the problem at hand.

Since you use local to install the handler, I think the old one will be
reinstalled upon fork failure anyway. Saving the old one explicitly and
reinstalling explicit seem to be unnecessary, assuming the local is doing
its job.

Xho
 
M

Mark

I am getting intermittent unexpected result from waitpid on Solaris 9

sub runIt
{
my $oldSigCld = $SIG{CLD};
local $SIG{CLD} = \&sigChldHandler;

I think you meant sigCLDhandler here.
 
A

A

The waitpid of one getChildStatus returns the expected pid and sets the
global $? and $!. Before it can do anything else, the waitpid of the other
getChildStatus returns -1 and over writes the global $? and $! with it's
own values, but for this one $r does not meet the if and so returns control
to the first getChildStatus. The first getChildStatus was the right pid
recorded in $r (as that was a lexical and didn't get overwritten), but has
the wrong $? and $! because they did get overwritten, and now those get
recorded into your $tm and $cm

Thanks for your explanation. Yes, it has every indication of being a
race condition.
Probably all of them, but it is not really possible to know from what you
give. We would need to see the code that set the orginal handler that is
getting saved and then restored. If the handler you inherit is necessary,
then why would it be safe to overwrite it with something else for even the
duration of this routine? On the other hand, if the handler you inherit is
not necessary, then what is the point of saving and re-installing it? If
there is no other code which intalls a handler in the first place, then I'd
remove all three of those things. (And even if not, remove at least two,
see below)


Since you use local to install the handler, I think the old one will be
reinstalled upon fork failure anyway. Saving the old one explicitly and
reinstalling explicit seem to be unnecessary, assuming the local is doing
its job.

Xho

I am not sure I understand your remark that the rest of the code
should be a factor in determining what signal handling should be used
here. The rest of the code may or may not contain other routines which
may or may not install their own handlers, and the same may be true
for the calling routine.

Now I had been scouring our logs since my earlier posts, and here is
the finding:
This is a "bug" in Perl 5.8.8 itself, at least in Perl 5.8.8 on
Solaris 9.


The original Program_A (in Perl) and Program_B (C++) are running
unchanged for about a year. We had been using Perl 5.6.1 that came
with the Solaris. We discovered a separate "bug" in that Perl (related
to file locking). Then we migrated to the Perl 5.8.8. And the "errors"
I described in the OP started appearing exactly at the same time.

More tellingly, these applications run of a set of servers which are,
network wise and geographic location wise, diverse. The upgrade to the
Perl 5.8.8 was done in stages and the "problem" started on each
machine on precisely the date that machine was upgraded to Perl 5.8.8.

So, it appears that we traded one Perl bug for another.

Thanks.
 
X

xhoster

A said:
....

I am not sure I understand your remark that the rest of the code
should be a factor in determining what signal handling should be used
here.

If the rest of the code installed a sig-child handler, it probably did it
for a reason--it expects to get signaled upon the exit of a child it
started, and expects to do some needful thing upon that signal. So let's
say you uninstall that handler for a brief period, and before you
re-install it the child that the other part of the code spawned exits. Now
the originally installed handler is restored, but it is never going to get
called, because your code already ate that signal. That is probably bad.
Why is that bad? I don't know, because I don't know what the rest of the
code does. But presumably it wouldn't have installed a signal handler if
it didn't need to--and if it did need it then having it not get called is
bad. On the other hand, the code you showed us installed a signal handler
despite (apparently) not needing to, so maybe assuming that the rest of the
code would only install a signal handler if it needed it is a bad
assumption.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top