waitpid woes on Solaris, Perl 5.8.8

Discussion in 'Perl Misc' started by A, Feb 13, 2007.

  1. A

    A Guest

    I am getting intermittent unexpected result from waitpid on Solaris 9
    running Perl 5.8.8.

    Here is the scenario (the bare bones code is below).

    Program_A, written in Perl, is invoked about a million times every
    day. Most of the times, it invokes (using fork-exec) Program_B which
    is written in C++. Program_A uses waitpid to get the exit code of
    Program_B.
    It works fine most of the times, but about a few dozen times every
    day, the waitpid apparently fails and when it fails, I get

    $? is -1
    $! is "No child processes"

    In all of the cases I have investigated, the child process, Program_B,
    started and completed gracefully with "exit(0)" and of course, the pid-
    s match from the trace log of both processes.

    The output, from the code below, in such case is

    Child pid=5196, exitCode=0xffffffff (No child processes)

    Program_A itself is transient and short lived, and, depending on its
    input, executes Program_B at most once.

    What am I doing wrong?
    How to detect and correct this?

    Thanks for your help.

    # ------------------------------------------- begin code
    -------------------------------------------------
    #!/usr/local/bin/perl

    # program_A

    my $cpid;
    my $ec = undef;
    my $em = undef;

    sub getChildStatus
    {
    my $tc = undef;
    my $tm = undef;
    my $r = undef;

    while ( 1 ) {
    $r = waitpid($cpid, 0);
    $tc = $?;
    $em = $!;
    last if ( -1 == $r || $r == $cpid );
    print STDERR "waitpid($cpid, 0) returned $r ( $? )\n";
    }
    if ( $cpid == $r ) {
    $ec = $tc;
    $em = $tm;
    }
    }

    sub sigCLDhandler
    {
    my $sig = shift;
    print STDERR "caught SIG $sig\n";
    getChildStatus;
    }


    sub runIt
    {
    my $oldSigCld = $SIG{CLD};
    local $SIG{CLD} = \&sigChldHandler;

    $cpid = fork;
    if ( ! defined $cpid ) { print STDERR "fork failed [ $! ]\n";
    return; }

    if ( 0 == $cpid ) {
    print STDERR "child pid $$ starting\n";

    exec program_B, .. .. ..;

    print STDERR "child pid $$: exec failed [$!], exiting with -1\n";
    exit(-1);
    } # 0 == $cpid i.e. the child

    getChildStatus; # only the parent reaches here
    $SIG{CLD} = $oldSigCld ;
    } # runIt

    #
    # main
    #
    runIt;
    if ( $ec ) {
    printf STDERR "Child pid=$cpid exitcode=%#08x msg=(%s)\n", $ec, $em;
    }

    # ------------------------------------------- end code
    -------------------------------------------------
    A, Feb 13, 2007
    #1
    1. Advertising

  2. A

    Guest

    "A" <> wrote:
    > I am getting intermittent unexpected result from waitpid on Solaris 9
    > running Perl 5.8.8.
    >
    > Here is the scenario (the bare bones code is below).
    >
    > Program_A, written in Perl, is invoked about a million times every
    > day. Most of the times, it invokes (using fork-exec) Program_B which
    > is written in C++. Program_A uses waitpid to get the exit code of
    > Program_B.
    > It works fine most of the times, but about a few dozen times every
    > day, the waitpid apparently fails and when it fails, I get
    >
    > $? is -1
    > $! is "No child processes"
    >
    > In all of the cases I have investigated, the child process, Program_B,
    > started and completed gracefully with "exit(0)" and of course, the pid-
    > s match from the trace log of both processes.
    >
    > The output, from the code below, in such case is
    >
    > Child pid=5196, exitCode=0xffffffff (No child processes)
    >
    > Program_A itself is transient and short lived, and, depending on its
    > input, executes Program_B at most once.
    >
    > What am I doing wrong?


    You are mucking with $SIG{CLD} when, as far as I can tell, you have
    no need to. getChildStatus (and the waitpid in it) can get called twice,
    once from the sig handler and once from the runIt. If it does get called
    twice, the second time that child no longer exists, as it was already
    waited on. Remove the $SIG{CLD} stuff.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Feb 13, 2007
    #2
    1. Advertising

  3. A

    A Guest

    On Feb 13, 3:44 pm, wrote:
    >
    > You are mucking with $SIG{CLD} when, as far as I can tell, you have
    > no need to. getChildStatus (and the waitpid in it) can get called twice,
    > once from the sig handler and once from the runIt. If it does get called
    > twice, the second time that child no longer exists, as it was already
    > waited on. Remove the $SIG{CLD} stuff.
    >
    > Xho
    >
    > - Show quoted text -


    Thanks for your reply.

    First, there's a typo in my original message.

    The third line after the while(1) in getChildStatus should be
    $tm = $!;
    instead of
    $em = $!;

    Now, to the point that the waitpid could get called twice.

    Please note that the code is designed to guard against this, the
    assignments to the globals $ec and $em are done if and only if waitpid
    returns the matching pid.
    So, even if it is called twice, the second time waitpid returns -1,
    and then
    getChildStatus returns without modifying the globals.

    On your advice to remove the $SIG{CLD}, there are 3 statements,

    the first statement saves the handler,
    the second statement installs the current one needed by this
    routine
    and the last one re-installs the saved handler.

    which one(s) would you suggest I remove?

    Yes, there's a deficiency (bug, if you will) in the code. The
    $SIG{CLD} should be re-installed if fork fails, but that I think, is
    of no consequence to the problem at hand.

    Thanks again.
    A, Feb 14, 2007
    #3
  4. A

    Guest

    "A" <> wrote:
    > On Feb 13, 3:44 pm, wrote:
    > >
    > > You are mucking with $SIG{CLD} when, as far as I can tell, you have
    > > no need to. getChildStatus (and the waitpid in it) can get called
    > > twice, once from the sig handler and once from the runIt. If it does
    > > get called twice, the second time that child no longer exists, as it
    > > was already waited on. Remove the $SIG{CLD} stuff.
    > >
    > > Xho
    > >
    > > - Show quoted text -

    >
    > Thanks for your reply.
    >
    > First, there's a typo in my original message.
    >
    > The third line after the while(1) in getChildStatus should be
    > $tm = $!;
    > instead of
    > $em = $!;
    >
    > Now, to the point that the waitpid could get called twice.
    >
    > Please note that the code is designed to guard against this, the
    > assignments to the globals $ec and $em are done if and only if waitpid
    > returns the matching pid.


    The waitpid of one getChildStatus returns the expected pid and sets the
    global $? and $!. Before it can do anything else, the waitpid of the other
    getChildStatus returns -1 and over writes the global $? and $! with it's
    own values, but for this one $r does not meet the if and so returns control
    to the first getChildStatus. The first getChildStatus was the right pid
    recorded in $r (as that was a lexical and didn't get overwritten), but has
    the wrong $? and $! because they did get overwritten, and now those get
    recorded into your $tm and $cm

    >
    > On your advice to remove the $SIG{CLD}, there are 3 statements,
    >
    > the first statement saves the handler,
    > the second statement installs the current one needed by this
    > routine
    > and the last one re-installs the saved handler.
    >
    > which one(s) would you suggest I remove?


    Probably all of them, but it is not really possible to know from what you
    give. We would need to see the code that set the orginal handler that is
    getting saved and then restored. If the handler you inherit is necessary,
    then why would it be safe to overwrite it with something else for even the
    duration of this routine? On the other hand, if the handler you inherit is
    not necessary, then what is the point of saving and re-installing it? If
    there is no other code which intalls a handler in the first place, then I'd
    remove all three of those things. (And even if not, remove at least two,
    see below)

    > Yes, there's a deficiency (bug, if you will) in the code. The
    > $SIG{CLD} should be re-installed if fork fails, but that I think, is
    > of no consequence to the problem at hand.


    Since you use local to install the handler, I think the old one will be
    reinstalled upon fork failure anyway. Saving the old one explicitly and
    reinstalling explicit seem to be unnecessary, assuming the local is doing
    its job.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Feb 14, 2007
    #4
  5. A

    Mark Guest

    On Feb 13, 11:22 am, "A" <> wrote:
    > I am getting intermittent unexpected result from waitpid on Solaris 9
    >
    > sub runIt
    > {
    > my $oldSigCld = $SIG{CLD};
    > local $SIG{CLD} = \&sigChldHandler;


    I think you meant sigCLDhandler here.
    Mark, Feb 15, 2007
    #5
  6. A

    A Guest

    On Feb 14, 7:54 pm, "Mark" <> wrote:
    > On Feb 13, 11:22 am, "A" <> wrote:
    >
    > > I am getting intermittent unexpected result fromwaitpidon Solaris 9

    >
    > > sub runIt
    > > {
    > > my $oldSigCld = $SIG{CLD};
    > > local $SIG{CLD} = \&sigChldHandler;

    >
    > I think you meant sigCLDhandler here.


    Yes!
    A, Feb 20, 2007
    #6
  7. A

    A Guest

    On Feb 14, 12:31 pm, wrote:
    > "A" <> wrote:
    > > On Feb 13, 3:44 pm, wrote:

    >
    > > > You are mucking with $SIG{CLD} when, as far as I can tell, you have
    > > > no need to. getChildStatus (and thewaitpidin it) can get called
    > > > twice, once from the sig handler and once from the runIt. If it does
    > > > get called twice, the second time that child no longer exists, as it
    > > > was already waited on. Remove the $SIG{CLD} stuff.

    >
    > > > Xho

    >
    > > > - Show quoted text -

    >
    > > Thanks for your reply.

    >
    > > First, there's a typo in my original message.

    >
    > > The third line after the while(1) in getChildStatus should be
    > > $tm = $!;
    > > instead of
    > > $em = $!;

    >
    > > Now, to the point that thewaitpidcould get called twice.

    >
    > > Please note that the code is designed to guard against this, the
    > > assignments to the globals $ec and $em are done if and only ifwaitpid
    > > returns the matching pid.

    >
    > The waitpid of one getChildStatus returns the expected pid and sets the
    > global $? and $!. Before it can do anything else, the waitpid of the other
    > getChildStatus returns -1 and over writes the global $? and $! with it's
    > own values, but for this one $r does not meet the if and so returns control
    > to the first getChildStatus. The first getChildStatus was the right pid
    > recorded in $r (as that was a lexical and didn't get overwritten), but has
    > the wrong $? and $! because they did get overwritten, and now those get
    > recorded into your $tm and $cm
    >


    Thanks for your explanation. Yes, it has every indication of being a
    race condition.

    >
    > > On your advice to remove the $SIG{CLD}, there are 3 statements,

    >
    > > the first statement saves the handler,
    > > the second statement installs the current one needed by this
    > > routine
    > > and the last one re-installs the saved handler.

    >
    > > which one(s) would you suggest I remove?

    >
    > Probably all of them, but it is not really possible to know from what you
    > give. We would need to see the code that set the orginal handler that is
    > getting saved and then restored. If the handler you inherit is necessary,
    > then why would it be safe to overwrite it with something else for even the
    > duration of this routine? On the other hand, if the handler you inherit is
    > not necessary, then what is the point of saving and re-installing it? If
    > there is no other code which intalls a handler in the first place, then I'd
    > remove all three of those things. (And even if not, remove at least two,
    > see below)
    >
    > > Yes, there's a deficiency (bug, if you will) in the code. The
    > > $SIG{CLD} should be re-installed if fork fails, but that I think, is
    > > of no consequence to the problem at hand.

    >
    > Since you use local to install the handler, I think the old one will be
    > reinstalled upon fork failure anyway. Saving the old one explicitly and
    > reinstalling explicit seem to be unnecessary, assuming the local is doing
    > its job.
    >
    > Xho
    >


    I am not sure I understand your remark that the rest of the code
    should be a factor in determining what signal handling should be used
    here. The rest of the code may or may not contain other routines which
    may or may not install their own handlers, and the same may be true
    for the calling routine.

    Now I had been scouring our logs since my earlier posts, and here is
    the finding:
    This is a "bug" in Perl 5.8.8 itself, at least in Perl 5.8.8 on
    Solaris 9.


    The original Program_A (in Perl) and Program_B (C++) are running
    unchanged for about a year. We had been using Perl 5.6.1 that came
    with the Solaris. We discovered a separate "bug" in that Perl (related
    to file locking). Then we migrated to the Perl 5.8.8. And the "errors"
    I described in the OP started appearing exactly at the same time.

    More tellingly, these applications run of a set of servers which are,
    network wise and geographic location wise, diverse. The upgrade to the
    Perl 5.8.8 was done in stages and the "problem" started on each
    machine on precisely the date that machine was upgraded to Perl 5.8.8.

    So, it appears that we traded one Perl bug for another.

    Thanks.
    A, Feb 20, 2007
    #7
  8. A

    Guest

    "A" <> wrote:
    >
    > >
    > > > On your advice to remove the $SIG{CLD}, there are 3 statements,

    > >
    > > > the first statement saves the handler,
    > > > the second statement installs the current one needed by this
    > > > routine
    > > > and the last one re-installs the saved handler.

    > >
    > > > which one(s) would you suggest I remove?

    > >
    > > Probably all of them, but it is not really possible to know from what
    > > you give. We would need to see the code that set the orginal handler
    > > that is getting saved and then restored. If the handler you inherit is
    > > necessary, then why would it be safe to overwrite it with something
    > > else for even the duration of this routine? On the other hand, if the
    > > handler you inherit is not necessary, then what is the point of saving
    > > and re-installing it? If there is no other code which intalls a
    > > handler in the first place, then I'd remove all three of those things.
    > > (And even if not, remove at least two, see below)


    ....

    > I am not sure I understand your remark that the rest of the code
    > should be a factor in determining what signal handling should be used
    > here.


    If the rest of the code installed a sig-child handler, it probably did it
    for a reason--it expects to get signaled upon the exit of a child it
    started, and expects to do some needful thing upon that signal. So let's
    say you uninstall that handler for a brief period, and before you
    re-install it the child that the other part of the code spawned exits. Now
    the originally installed handler is restored, but it is never going to get
    called, because your code already ate that signal. That is probably bad.
    Why is that bad? I don't know, because I don't know what the rest of the
    code does. But presumably it wouldn't have installed a signal handler if
    it didn't need to--and if it did need it then having it not get called is
    bad. On the other hand, the code you showed us installed a signal handler
    despite (apparently) not needing to, so maybe assuming that the rest of the
    code would only install a signal handler if it needed it is a bad
    assumption.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Feb 21, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. spawnl and waitpid

    , Feb 27, 2007, in forum: Python
    Replies:
    13
    Views:
    766
  2. lasek

    Fork + Waitpid

    lasek, May 13, 2005, in forum: C Programming
    Replies:
    4
    Views:
    6,008
    SM Ryan
    May 14, 2005
  3. Mike

    'waitpid' query

    Mike, Jan 28, 2009, in forum: C Programming
    Replies:
    10
    Views:
    576
    Kenny McCormack
    Jan 29, 2009
  4. Fan
    Replies:
    1
    Views:
    364
    Christopher Head
    Jul 16, 2011
  5. Thomas Hafner

    chaining processes, Process.waitpid

    Thomas Hafner, Apr 14, 2007, in forum: Ruby
    Replies:
    0
    Views:
    112
    Thomas Hafner
    Apr 14, 2007
Loading...

Share This Page