Fork() lies?!?

Discussion in 'Perl Misc' started by el.dodgero@gmail.com, Apr 26, 2006.

  1. Guest

    Hey, I just had this weird issue come up. A Schroedingbug of some sort.

    I have a perlscript that deals with grabbing database updates from a
    screen scrape off a tn3270 connection. It gets the list of records
    (about 5,000 of them) and then gets the information form the tn3270
    screen scrape, which generally amounts in about 30,000 (average of rows
    per item).

    To make this whole thing run faster, since I have 19 dedicated seperate
    logins to the tn3270 connection, and since the screen scraping is SLOW
    compared to most other things, rather than conecting and updating one
    at a time, I slice the list up into 19 even (or a little short on the
    last chunk) chunks, and then hand them off to child processes. I then
    write the PIDs I get back form the kids into a file as they return.

    A controller script that wraps around this runs this script, then
    checks the PID file against ps to determine of those kids, by pids, are
    still running. Each of the kids is just doings its list and writing the
    results into the bottom of the same tab-delimited text file.

    When all the pids are done, i.e. when none of them are found in ps, the
    controller script then runs sqlldr and loads all the stuff into the
    database.

    The entire process takes about 12-20 minutes depending on hos slow the
    network is running at any given time. This, compared to the up to three
    hours it used to take, is a real improvement, and the approach makes
    sense.

    It's also been working flawlessly for weeks.

    Until this week. Things started flaking out, and I was wracking my
    brain to figure out why. I ran it, and it was acting like it was
    returning way too fast, and doing pretty much nothing, or only putting
    in a few rows (like less than 100).

    Finally, I ran the first script manually. Then I copied the pid file
    into another file, opened it in vim, appended a kill -9 to the
    beginning of every line, made it executable, and tried to kill off all
    the kids.

    It didn't work. The PIDs were wrong.

    The PIDs -- on Solaris, mind you, not like Windows or something weird
    that would need to emulate fork() -- the PIDs returned from fork() were
    *wrong*, at least compared to what ps showed running.

    I can't for the life of me figure out why this would be. I checked for
    possible hacker attmepts in case someone was running something that
    deliberately tricked the kernel into offsetting process IDs, but didn't
    see anything of the sort. All I know is that I got 19 process IDs back
    form fork() that were NOT the process IDs that were actually running
    and reported in the output of ps.

    Anyone know anything, or is this going to be one of those mystery posts
    that hangs around on the net where, a year later, when someone has the
    same issue they google for it, find this message and get all excited
    that they have found the solution, only to realise that it's just this
    empty, unanswered question? *

    * Blatant guilt-trip for anyone who knows the answer but doesn't feel
    like posting. Have some chicken soup and matzo! Eat, eat, ya so skinny!
    You nevah call anymoa!
     
    , Apr 26, 2006
    #1
    1. Advertising

  2. Guest

    They were process IDs that didn't point to anything else running. I
    mean, I'm not 100% sure they weren't running for a split second or
    something.

    It's just weird that it was all fine for three weeks or more.
     
    , Apr 26, 2006
    #2
    1. Advertising

  3. wrote:
    > They were process IDs that didn't point to anything else running. I
    > mean, I'm not 100% sure they weren't running for a split second or
    > something.


    Add logging, even if to start with it is only just something like

    child pid starts time
    child pid processed x records
    child pid ends time

    Check out Log::Log4perl.

    > It's just weird that it was all fine for three weeks or more.


    Or it wasn't working and you have only just noticed.

    Mark
     
    Mark Clements, Apr 26, 2006
    #3
  4. Guest

    wrote:
    > Hey, I just had this weird issue come up. A Schroedingbug of some sort.
    >
    > I have a perlscript that deals with grabbing database updates from a
    > screen scrape off a tn3270 connection. It gets the list of records
    > (about 5,000 of them) and then gets the information form the tn3270
    > screen scrape, which generally amounts in about 30,000 (average of rows
    > per item).
    >
    > To make this whole thing run faster, since I have 19 dedicated seperate
    > logins to the tn3270 connection, and since the screen scraping is SLOW
    > compared to most other things, rather than conecting and updating one
    > at a time, I slice the list up into 19 even (or a little short on the
    > last chunk) chunks, and then hand them off to child processes. I then
    > write the PIDs I get back form the kids into a file as they return.
    >
    > A controller script that wraps around this runs this script, then
    > checks the PID file against ps to determine of those kids, by pids, are
    > still running.


    Why not just wait for those children from the parent?

    >
    > The PIDs -- on Solaris, mind you, not like Windows or something weird
    > that would need to emulate fork() -- the PIDs returned from fork() were
    > *wrong*, at least compared to what ps showed running.
    >
    > I can't for the life of me figure out why this would be.


    Neither can I. If you showed me the part of the the code that does the
    forking and the pid writing, maybe I could.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 26, 2006
    #4
  5. Guest

    Oh, it was definitely working. It was tested loading into a truncated
    table, with the .dat file deleted beforehand. Plus the script deleted
    the .dat file and replaces it. That data has to be coming from
    somewhere B^)

    Okay, for the code...
    In the control script that runs the other processes:

    my %pids;
    open PID, '/opt/access/scripts/newt3util.pid'
    or die "Can't read PID file newt3util.pid: $!";
    map {chomp; $pids{$_} = 1} (<PID>);
    close PID;

    while (grep $_, values %pids) {
    # while there are non zero values in the pids hash
    my %procs = map {$_ => 1} split /\n/, `ps -eo pid`;
    # store the current list of running system processes in the procs
    hash
    for my $pid (keys %pids) {
    # for every pid in the pids hash
    unless ($procs{$pid}) {
    # set the value to zero if it's not in the procs hash
    # because that means it's done
    $pids{$pid} = 0;
    }
    }
    sleep 1;
    # rest for a second before checking again
    }


    These pids are coming from the PID file which is written to by the
    parent that kicks everything off:

    $util->logIt("Slicing to $num_per_child circuits per child process:
    covers " .
    $num_per_child * $maxchildren . "\n");

    warn "Got ", scalar(@{$all_circuits}), " circuits.\n";

    while ($all_circuits) {
    if (scalar @{$all_circuits} > $num_per_child) {
    push @{$circuits}, [splice @{$all_circuits}, 0,
    $num_per_child];
    }
    else {
    push @{$circuits}, $all_circuits;
    last;
    }
    }
    undef $all_circuits;

    open PID, ">/opt/access/scripts/newt3util.pid";
    print PID "$$\n";

    my $connid = 9;
    for my $batch (@{$circuits}) {
    my $pid;
    $connid++;
    if ($connid > 28) {
    $util->logIt("Out of valid connection IDs. Existing processes
    will ".
    "continue but no more circuits can be processed.
    Run in ".
    "non-init mode (no -i) when this is finished to
    get the rest.\n");
    last;
    }
    if ($pid = fork) {
    print PID "$pid\n";
    next;
    }
    else {
    $util->logIt("Starting child process on id $connid with " .
    scalar(@{$batch}) . " circuits.\n");
    $util->{userid} = sprintf '*******%03d', $connid;
    $util->setupConn; #
    $util->get_circuit_xx($batch, \&store_line);
    $util->logIt("Finished child process $connid.\n");
    last;
    }
    }


    Note there are some slight changes I had to make, i.e. some of the
    subroutines and warnings contained the name of the screen based system
    to connect to, and to post this on the net I did have to censor that,
    replacing the name with 'conn' where applicable, and a few tiny other
    similar things. Work rules, had to. Same with the starred out userid.
     
    , Apr 26, 2006
    #5
  6. Guest

    I just posted the code, with tiny alterations just to hide proprietary
    info, usernames, etc. It's in my reply to Mark Clements' post above.
     
    , Apr 26, 2006
    #6
  7. ....

    > These pids are coming from the PID file which is written to by the
    > parent that kicks everything off:
    >

    ....

    > for my $batch (@{$circuits}) {
    > my $pid;

    ....

    > if ($pid = fork) {
    > next;

    ....
    > }
    > else {

    ....
    > $util->logIt("Finished child process $connid.\n");
    > last;
    > }
    > }


    I guess the (or one) critical question is what will happen when the child
    lasts out of that loop? If the 'last' is effectively an 'exit', then I'm
    looking at the wrong lines, but if something else is going to happen, that
    something might give you trouble.

    I suggest you replace 'last' with 'exit' anyway.


    Manni
     
    Manni Heumann, Apr 27, 2006
    #7
  8. Guest

    wrote:
    >
    > open PID, ">/opt/access/scripts/newt3util.pid";
    > print PID "$$\n";


    I really don't see why you are doing it this way in this convoluted way in
    the first place, but I'll assume you have your reasons.

    >
    > my $connid = 9;
    > for my $batch (@{$circuits}) {
    > my $pid;
    > $connid++;
    > if ($connid > 28) {
    > $util->logIt("Out of valid connection IDs. Existing processes
    > will ".
    > "continue but no more circuits can be processed.
    > Run in ".
    > "non-init mode (no -i) when this is finished to
    > get the rest.\n");
    > last;
    > }
    > if ($pid = fork) {
    > print PID "$pid\n";
    > next;
    > }
    > else {


    die "failed to fork $!" unless defined $pid.
    my $oldpid=$$;

    > $util->logIt("Starting child process on id $connid with " .
    > scalar(@{$batch}) . " circuits.\n");


    #log the pid here, too, so you can compare with the one printed to PID.

    $util->logIt("Starting child process $$ on id $connid with " .
    scalar(@{$batch}) . " circuits.\n");


    > $util->{userid} = sprintf '*******%03d', $connid;
    > $util->setupConn; #


    die unless $$==$oldpid;

    > $util->get_circuit_xx($batch, \&store_line);


    die unless $$==$oldpid;

    > $util->logIt("Finished child process $connid.\n");


    die unless $$==$oldpid;

    > last;
    > }
    > }
    >
    > Note there are some slight changes I had to make, i.e. some of the
    > subroutines and warnings contained the name of the screen based system
    > to connect to, and to post this on the net I did have to censor that,
    > replacing the name with 'conn' where applicable, and a few tiny other
    > similar things.


    Did you re-run it after these changes to make sure it still exhibited
    similar behavior?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 27, 2006
    #8
  9. Guest

    It doesn't (or rather didn't) matter, because there was no code below
    the loop except subs.

    It's exit now, because I added into this script a loop over wait that
    stops when the wait result is -1. From what I can tell this should
    work. I'm still writing out the pid file if I need to kill things, but
    I don't trust it.
     
    , Apr 28, 2006
    #9
  10. Guest

    > Did you re-run it after these changes to make sure it still exhibited
    > similar behavior?


    No, of course not. I just changed those as I pasted the code in here to
    hide the stuff I have to hide. The logic is no different, just the
    names are changed so I'm not potentially disclosing anything.

    Anyway it wouldnt' work with all the changes because one of the changes
    I made was to star out the password, and also some of the method names
    have the name of the screen base system in them, and I fudged those
    method names to avoid revealing it lest my employer feel I'm being a
    security risk.

    I've changed the parent script to call exit rather than last on the
    children, and then, after the spawning loop to while loop over wait,
    waiting for either -1 or for all the child process I stached into a
    hash to be returned, either case indicating that things have all died.

    Thing is, i really wanted to figure out why fork(0 was lying to me. I
    mean, I know I can work around it like this, it's just... I don't like
    fork() lying.
     
    , Apr 28, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Bailo
    Replies:
    1
    Views:
    458
    Andrew Thompson
    Nov 7, 2004
  2. Replies:
    0
    Views:
    353
  3. projecktzero

    lies about OOP

    projecktzero, Dec 14, 2004, in forum: Python
    Replies:
    80
    Views:
    1,884
    Christos TZOTZIOY Georgiou
    Dec 31, 2004
  4. Ben O'Steen
    Replies:
    2
    Views:
    306
    Paul McGuire
    Oct 28, 2005
  5. Eric Snow

    os.fork and pty.fork

    Eric Snow, Jan 8, 2009, in forum: Python
    Replies:
    0
    Views:
    581
    Eric Snow
    Jan 8, 2009
Loading...

Share This Page