question about forked processes writing to the same file

  • Thread starter it_says_BALLS_on_your forehead
  • Start date
X

xhoster

thanks, i'll work on improving this to help you help me more :)

And even better, trimmed down code is good for benchmarking and other
experimentation.
they were initially there because we have an ETL (Extract, Transform,
Load) tool that picks them up, and it was determined that this tool
could optimally use 10 threads to gather these processed records. it
would be best if these 10 records were the same size (or near enough).
i'm working on a 16 CPU machine, so that's why i created 16 forked
processes.

Since your processes have a mixed work load, needing to do both Net::FTP
(probably I/O bound) and per line processing (probably CPU bound), it might
make sense to use more than 16 of them.

....
So I was thinking of
creating a counter, and incrementing it in the loop that spawns the
processes, and mod this counter by 16. the mod'd counter will be a
value that is passed to the betaParse script/function and the parsing
will use this value to choose which filehandle it writes to.

Don't do that. Lets say you start children 0 through 15, writing to
files 0 through 15. Child 8 finishes first. So ForkManager starts child
16, which tries to write to file 16%16 i.e. 0. But of course file 0 is
still being used by child 0. If you wish to avoid doing a flock for every
row, you need to mandate that no two children can be using the same file at
the same time.

I see two good ways to accomplish that. The first is simply to have each
child, as one of the first things it does, loop through a list of
filenames. For each, it opens it and attempts a nonblocking flock. Once it
finds a file it can successfully flock, it keeps that lock for the rest if
its life, and uses that filehandle for output.

The other way is to have ForkManager (in the parent) manage the files that
the children will write to. This has the advantage that, as long as there
is only one parent process running at once, you don't actually need to do
any flocking in the children, as the parent ensures they don't intefere
(but I do the locking anyway if I'm on a system that supports it. Better
safe than sorry):

my $pm=new Parallel::ForkManager(16);

#tokens for the output files.
my @outputID="file01".."file20"; # needs to be >= 16, of course;

#put the token back into the queue once the child is done.
$pm->run_on_finish( sub { push @outputID, $_[2] } ) ;

#...
foreach my $whatever (@whatever) {
#get the next available token for output
my $oid=shift @outputID or die;
$pm->start($oid) and next;
open my $fh, ">>", "/tmp/$oid" or die $!;
flock $fh, LOCK_EX|LOCK_NB or die "Hey, someone is using my file!";
#hold the lock for life
#...
while (<$in>) {
#...
print $fh $stuff_to_print;
};
close $fh or die $!;
$pm->finish();
};



yeah, i never thought it would be easy :)

Theme song from grad school days: "No one said it would be easy, but no one
said it would be this hard."

Xho
 
I

it_says_BALLS_on_your_forehead

The other way is to have ForkManager (in the parent) manage the files that
the children will write to. This has the advantage that, as long as there
is only one parent process running at once, you don't actually need to do
any flocking in the children, as the parent ensures they don't intefere
(but I do the locking anyway if I'm on a system that supports it. Better
safe than sorry):

my $pm=new Parallel::ForkManager(16);

#tokens for the output files.
my @outputID="file01".."file20"; # needs to be >= 16, of course;

#put the token back into the queue once the child is done.
$pm->run_on_finish( sub { push @outputID, $_[2] } ) ;

#...
foreach my $whatever (@whatever) {
#get the next available token for output
my $oid=shift @outputID or die;
$pm->start($oid) and next;
open my $fh, ">>", "/tmp/$oid" or die $!;
flock $fh, LOCK_EX|LOCK_NB or die "Hey, someone is using my file!";
#hold the lock for life
#...
while (<$in>) {
#...
print $fh $stuff_to_print;
};
close $fh or die $!;
$pm->finish();
};

ahh, similar to the callback example on the CPAN site. brilliant! :-D
 
I

it_says_BALLS_on_your forehead

Since your processes have a mixed work load, needing to do both Net::FTP
(probably I/O bound) and per line processing (probably CPU bound), it might
make sense to use more than 16 of them.

...

i'm not sure i see how you derive your conclusion here? i don't know
too much about resource utilization/consumption. i just heard that it
was optimal to use as many processes as there are CPUs. i can see why
more processes would be better for more concurrent FTP gets, but
wouldn't the subsequent processing of > 16 files make the machine
sluggish since there are only 16 CPUs? i would say overall the entire
'thing' (hesitant to use the word process) is more CPU bound since the
file transfer comprises a significantly smaller percentage of the time
relative to the actual processing.

is it actually better to use more if i have a mixed work load? why
exactly? how does the machine allocate its resources?
 
X

xhoster

it_says_BALLS_on_your forehead said:
i'm not sure i see how you derive your conclusion here?

Doing an FTP is generally limited by IO (or, at least it reasonable to
assume so until specific data on it is gathered). It would be nice if your
CPUs had something to do while waiting for this IO, other than just sit and
wait for IO.
i don't know
too much about resource utilization/consumption. i just heard that it
was optimal to use as many processes as there are CPUs.

This is true if your processes are CPU bound.

i can see why
more processes would be better for more concurrent FTP gets, but
wouldn't the subsequent processing of > 16 files

But you are not running synchronously--each child proceeds to the per-line
processing as soon as its own data is loaded, regardless of what the other
children are doing. If you run 20 children and the FTP part takes 20% of
the time, then at any given moment you would expect about 4 processes to
the FTPing, using little CPU, and 16 processes to be grinding on the CPUs.
(If you use 16 children and FTPing is 20% of the time, then you would
expect 3.2 to be FTPing and 12.8 to be crunching at any one time, leaving
almost 3.2 CPUs idle)

Of course, one thing you have to watch out for is the distribution of the
processing. Just because programs *can* be arranged in a way that uses
both IO and CPU most efficiently doesn't mean they automatically will
arrange themselves that way. But with each task being of a more-or-less
random length, as you have, they shouldn't do too badly arranged just by
chance (other than right after the parent starts, when they will all be
doing FTP at the same time.)
make the machine
sluggish since there are only 16 CPUs?

"sluggish" is a term usually used for interactive programs, while yours
seems to be batch, and so sluggishness is probably of minimal concern.
i would say overall the entire
'thing' (hesitant to use the word process) is more CPU bound since the
file transfer comprises a significantly smaller percentage of the time
relative to the actual processing.

Yeah, in that case there is no compelling reason to go above 16. Which is
nice, as it is one less thing to need to consider.
is it actually better to use more if i have a mixed work load? why
exactly? how does the machine allocate its resources?

The FTP asks the remote computer to send it a bunches of data. While it is
waiting for those bunches of data to appear, the CPU is able to do other
things, provided there are other things to do. So, it is often worthwhile
to ensure that there *are* other things to do. But obviously if the FTP
time is a small percentage of overall time, then this makes little
difference in your case.

Also, it is possible, if your remote machines, ethernet, etc. are fast
enough compared to your CPUs, that the FTP itself is CPU rather than IO
bound, or nearly so. It's also possible that the local hard-drive that the
FTPed files are being written to are the limiting factor, or that the
remote machine(s) you are fetching from are the limiting factors.

But it occurs to me that I might be advocating premature optimization here.
Get it working first and see if it is "fast enough". If you can process an
hours worth of logs in 5 minutes, there is probably no point it making it
faster. If it takes 65 minutes to process an hours worth of logs, that
is probably the time to worry about 16 versus 20 versus 24 parallel
children.

Xho
 
A

axel

yeah, i don't think it's an issue when you're appending small strings
to a file, but they should make file locking default, don't you think?!
:)
when *wouldn't* you want to lock a file when writing?

Simple... when you have several Apache processes dealing with
web requests more or less at the same time and wanting to just
append to a log file. I was a bit worried about this at one
time using Solaris but I investigated and found it not to
be a problem. I suspect that BSD and Linux work in a similar
manner... I cannot answer for any other operating systems.

Axel
 
I

it_says_BALLS_on_your_forehead

Simple... when you have several Apache processes dealing with
web requests more or less at the same time and wanting to just
append to a log file. I was a bit worried about this at one
time using Solaris but I investigated and found it not to
be a problem. I suspect that BSD and Linux work in a similar
manner... I cannot answer for any other operating systems.

actually, i had several processes appending to the same log file and
came across records that had spliced entries. this was the very problem
of which i spoke, and why i ended up needing to explicitly flock.
 
A

axel

actually, i had several processes appending to the same log file and
came across records that had spliced entries. this was the very problem
of which i spoke, and why i ended up needing to explicitly flock.

Which operating system were you using? I never ran across this problem
using Solaris.

Axel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top