Fork() lies?!?

E

el.dodgero

Hey, I just had this weird issue come up. A Schroedingbug of some sort.

I have a perlscript that deals with grabbing database updates from a
screen scrape off a tn3270 connection. It gets the list of records
(about 5,000 of them) and then gets the information form the tn3270
screen scrape, which generally amounts in about 30,000 (average of rows
per item).

To make this whole thing run faster, since I have 19 dedicated seperate
logins to the tn3270 connection, and since the screen scraping is SLOW
compared to most other things, rather than conecting and updating one
at a time, I slice the list up into 19 even (or a little short on the
last chunk) chunks, and then hand them off to child processes. I then
write the PIDs I get back form the kids into a file as they return.

A controller script that wraps around this runs this script, then
checks the PID file against ps to determine of those kids, by pids, are
still running. Each of the kids is just doings its list and writing the
results into the bottom of the same tab-delimited text file.

When all the pids are done, i.e. when none of them are found in ps, the
controller script then runs sqlldr and loads all the stuff into the
database.

The entire process takes about 12-20 minutes depending on hos slow the
network is running at any given time. This, compared to the up to three
hours it used to take, is a real improvement, and the approach makes
sense.

It's also been working flawlessly for weeks.

Until this week. Things started flaking out, and I was wracking my
brain to figure out why. I ran it, and it was acting like it was
returning way too fast, and doing pretty much nothing, or only putting
in a few rows (like less than 100).

Finally, I ran the first script manually. Then I copied the pid file
into another file, opened it in vim, appended a kill -9 to the
beginning of every line, made it executable, and tried to kill off all
the kids.

It didn't work. The PIDs were wrong.

The PIDs -- on Solaris, mind you, not like Windows or something weird
that would need to emulate fork() -- the PIDs returned from fork() were
*wrong*, at least compared to what ps showed running.

I can't for the life of me figure out why this would be. I checked for
possible hacker attmepts in case someone was running something that
deliberately tricked the kernel into offsetting process IDs, but didn't
see anything of the sort. All I know is that I got 19 process IDs back
form fork() that were NOT the process IDs that were actually running
and reported in the output of ps.

Anyone know anything, or is this going to be one of those mystery posts
that hangs around on the net where, a year later, when someone has the
same issue they google for it, find this message and get all excited
that they have found the solution, only to realise that it's just this
empty, unanswered question? *

* Blatant guilt-trip for anyone who knows the answer but doesn't feel
like posting. Have some chicken soup and matzo! Eat, eat, ya so skinny!
You nevah call anymoa!
 
E

el.dodgero

They were process IDs that didn't point to anything else running. I
mean, I'm not 100% sure they weren't running for a split second or
something.

It's just weird that it was all fine for three weeks or more.
 
M

Mark Clements

They were process IDs that didn't point to anything else running. I
mean, I'm not 100% sure they weren't running for a split second or
something.

Add logging, even if to start with it is only just something like

child pid starts time
child pid processed x records
child pid ends time

Check out Log::Log4perl.
It's just weird that it was all fine for three weeks or more.

Or it wasn't working and you have only just noticed.

Mark
 
X

xhoster

Hey, I just had this weird issue come up. A Schroedingbug of some sort.

I have a perlscript that deals with grabbing database updates from a
screen scrape off a tn3270 connection. It gets the list of records
(about 5,000 of them) and then gets the information form the tn3270
screen scrape, which generally amounts in about 30,000 (average of rows
per item).

To make this whole thing run faster, since I have 19 dedicated seperate
logins to the tn3270 connection, and since the screen scraping is SLOW
compared to most other things, rather than conecting and updating one
at a time, I slice the list up into 19 even (or a little short on the
last chunk) chunks, and then hand them off to child processes. I then
write the PIDs I get back form the kids into a file as they return.

A controller script that wraps around this runs this script, then
checks the PID file against ps to determine of those kids, by pids, are
still running.

Why not just wait for those children from the parent?
The PIDs -- on Solaris, mind you, not like Windows or something weird
that would need to emulate fork() -- the PIDs returned from fork() were
*wrong*, at least compared to what ps showed running.

I can't for the life of me figure out why this would be.

Neither can I. If you showed me the part of the the code that does the
forking and the pid writing, maybe I could.

Xho
 
E

el.dodgero

Oh, it was definitely working. It was tested loading into a truncated
table, with the .dat file deleted beforehand. Plus the script deleted
the .dat file and replaces it. That data has to be coming from
somewhere B^)

Okay, for the code...
In the control script that runs the other processes:

my %pids;
open PID, '/opt/access/scripts/newt3util.pid'
or die "Can't read PID file newt3util.pid: $!";
map {chomp; $pids{$_} = 1} (<PID>);
close PID;

while (grep $_, values %pids) {
# while there are non zero values in the pids hash
my %procs = map {$_ => 1} split /\n/, `ps -eo pid`;
# store the current list of running system processes in the procs
hash
for my $pid (keys %pids) {
# for every pid in the pids hash
unless ($procs{$pid}) {
# set the value to zero if it's not in the procs hash
# because that means it's done
$pids{$pid} = 0;
}
}
sleep 1;
# rest for a second before checking again
}


These pids are coming from the PID file which is written to by the
parent that kicks everything off:

$util->logIt("Slicing to $num_per_child circuits per child process:
covers " .
$num_per_child * $maxchildren . "\n");

warn "Got ", scalar(@{$all_circuits}), " circuits.\n";

while ($all_circuits) {
if (scalar @{$all_circuits} > $num_per_child) {
push @{$circuits}, [splice @{$all_circuits}, 0,
$num_per_child];
}
else {
push @{$circuits}, $all_circuits;
last;
}
}
undef $all_circuits;

open PID, ">/opt/access/scripts/newt3util.pid";
print PID "$$\n";

my $connid = 9;
for my $batch (@{$circuits}) {
my $pid;
$connid++;
if ($connid > 28) {
$util->logIt("Out of valid connection IDs. Existing processes
will ".
"continue but no more circuits can be processed.
Run in ".
"non-init mode (no -i) when this is finished to
get the rest.\n");
last;
}
if ($pid = fork) {
print PID "$pid\n";
next;
}
else {
$util->logIt("Starting child process on id $connid with " .
scalar(@{$batch}) . " circuits.\n");
$util->{userid} = sprintf '*******%03d', $connid;
$util->setupConn; #
$util->get_circuit_xx($batch, \&store_line);
$util->logIt("Finished child process $connid.\n");
last;
}
}


Note there are some slight changes I had to make, i.e. some of the
subroutines and warnings contained the name of the screen based system
to connect to, and to post this on the net I did have to censor that,
replacing the name with 'conn' where applicable, and a few tiny other
similar things. Work rules, had to. Same with the starred out userid.
 
E

el.dodgero

I just posted the code, with tiny alterations just to hide proprietary
info, usernames, etc. It's in my reply to Mark Clements' post above.
 
M

Manni Heumann

....
These pids are coming from the PID file which is written to by the
parent that kicks everything off:
....

for my $batch (@{$circuits}) {
my $pid; ....

if ($pid = fork) {
next; ....
}
else { ....
$util->logIt("Finished child process $connid.\n");
last;
}
}

I guess the (or one) critical question is what will happen when the child
lasts out of that loop? If the 'last' is effectively an 'exit', then I'm
looking at the wrong lines, but if something else is going to happen, that
something might give you trouble.

I suggest you replace 'last' with 'exit' anyway.


Manni
 
X

xhoster

open PID, ">/opt/access/scripts/newt3util.pid";
print PID "$$\n";

I really don't see why you are doing it this way in this convoluted way in
the first place, but I'll assume you have your reasons.
my $connid = 9;
for my $batch (@{$circuits}) {
my $pid;
$connid++;
if ($connid > 28) {
$util->logIt("Out of valid connection IDs. Existing processes
will ".
"continue but no more circuits can be processed.
Run in ".
"non-init mode (no -i) when this is finished to
get the rest.\n");
last;
}
if ($pid = fork) {
print PID "$pid\n";
next;
}
else {

die "failed to fork $!" unless defined $pid.
my $oldpid=$$;
$util->logIt("Starting child process on id $connid with " .
scalar(@{$batch}) . " circuits.\n");

#log the pid here, too, so you can compare with the one printed to PID.

$util->logIt("Starting child process $$ on id $connid with " .
scalar(@{$batch}) . " circuits.\n");

$util->{userid} = sprintf '*******%03d', $connid;
$util->setupConn; #

die unless $$==$oldpid;
$util->get_circuit_xx($batch, \&store_line);

die unless $$==$oldpid;
$util->logIt("Finished child process $connid.\n");

die unless $$==$oldpid;
last;
}
}

Note there are some slight changes I had to make, i.e. some of the
subroutines and warnings contained the name of the screen based system
to connect to, and to post this on the net I did have to censor that,
replacing the name with 'conn' where applicable, and a few tiny other
similar things.

Did you re-run it after these changes to make sure it still exhibited
similar behavior?

Xho
 
E

el.dodgero

It doesn't (or rather didn't) matter, because there was no code below
the loop except subs.

It's exit now, because I added into this script a loop over wait that
stops when the wait result is -1. From what I can tell this should
work. I'm still writing out the pid file if I need to kill things, but
I don't trust it.
 
E

el.dodgero

Did you re-run it after these changes to make sure it still exhibited
similar behavior?

No, of course not. I just changed those as I pasted the code in here to
hide the stuff I have to hide. The logic is no different, just the
names are changed so I'm not potentially disclosing anything.

Anyway it wouldnt' work with all the changes because one of the changes
I made was to star out the password, and also some of the method names
have the name of the screen base system in them, and I fudged those
method names to avoid revealing it lest my employer feel I'm being a
security risk.

I've changed the parent script to call exit rather than last on the
children, and then, after the spawning loop to while loop over wait,
waiting for either -1 or for all the child process I stached into a
hash to be returned, either case indicating that things have all died.

Thing is, i really wanted to figure out why fork(0 was lying to me. I
mean, I know I can work around it like this, it's just... I don't like
fork() lying.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top