optimize log parsing

  • Thread starter it_says_BALLS_on_your forehead
  • Start date
I

it_says_BALLS_on_your forehead

i have roughly 400 logs to parse, with the size of each log ranging
from 200 bytes to 263,000,000 bytes.

i have 20 processes that i can allocate to log parsing.

i have a hash that has the log file name as the key, and its size in
bytes as the value. ( i can sort the entries by size, or use an array
if that's better suited to the solution. )

$file{'/var/log/log1'} => 200
$file{'/var/log/log2'} => 210
....
$file{'/var/log/log400'} => 262000000

i would like to feed each process roughly the same amount of data (i.e.
a certain queue of log files to parse, such that the sum of the bytes
of the log files for each process are roughly equivalent).

so let's just say there are 20 buckets, if that will make things more
comprehensible.

i calculate the sum of the sizes of all the logs, and it was about 7.6
GB. divided among 20 buckets, that's roughly 380 MB per bucket.

one caveat is that i can't break up files. these logs are gzipped, and
i can't split them while zipped. i tried unzipping the 262 MB one (that
took about 2 min 40 sec), and then splitting into files that were
600,000 lines each. that took over 10 min. after running (in Unix)
split -l 600000 file small_file, and waiting 10 min, i canceled, and
only 3 files were written so far! it probably would have taken 20
minutes or more to finish, and nothing was even being parsed yet.
unacceptable.

basically this problem can be reduced to figuring out a way to fit 400
discrete numbers into 20 buckets where the count of the numbers does
not matter, but the sum of the numbers should be roughly equal.

one way of accomplishing this is to sort descending, and loop through
the 400 files, putting the largest one in the first bucket, and going
down the line, putting the largest one that will fit into the current
bucket until you get to the end. granted, this is not perfect--perhaps
if you had waited, 2 smaller ones would have brought you closer to your
bucket-size goal. then you go to the next bucket, and do the same
thing. this is a O = n^2 solution though. i was hoping that there was a
more elegant, less brute-force method. perhaps the most perfect
solution would be to take combinations and permutations of all the 400
logs and finding which combinations of groups would result in the most
equal 20 buckets, but that would take way too much computation time.

using the method i outlined (not the combinations one, the one before
that), i would have to go through the list of 400 logs for the 1st
bucket, then let's say 395 logs for the 2nd bucket, etc. until the 20th
bucket. this really isn't that much, and it's unlikely that the number
of logs and processes will expand by orders of magnitude, so is it even
worth it to expend the effort finding this algorithm? any help would be
appreciated :)
 
X

xhoster

it_says_BALLS_on_your forehead said:
i have roughly 400 logs to parse, with the size of each log ranging
from 200 bytes to 263,000,000 bytes.

i have 20 processes that i can allocate to log parsing.

i have a hash that has the log file name as the key, and its size in
bytes as the value. ( i can sort the entries by size, or use an array
if that's better suited to the solution. )

$file{'/var/log/log1'} => 200
$file{'/var/log/log2'} => 210
...
$file{'/var/log/log400'} => 262000000

i would like to feed each process roughly the same amount of data (i.e.
a certain queue of log files to parse, such that the sum of the bytes
of the log files for each process are roughly equivalent).

so let's just say there are 20 buckets, if that will make things more
comprehensible.

i calculate the sum of the sizes of all the logs, and it was about 7.6
GB. divided among 20 buckets, that's roughly 380 MB per bucket.

I wouldn't bother with this 'bucket' stuff at all. Just do it on the fly.
By addressing the files in the appropriate order (from most work to least
work) you ensure nearly optimal processing. In fact, because there is no
guarantee that the actual time for a file to be processed is exactly
proportional to the file size, balancing on the fly is almost surely going
to be better than some precomputed balancing based on the assumption that
size = time.

$pm = new Parallel::ForkManager(20);

foreach $file (sort {$files{$b}<=>$files{$a}} keys %files) {
my $pid = $pm->start and next;
##Process the $file
$pm->finish; # Terminates the child process
}
$pm->wait_all_children;

....
basically this problem can be reduced to figuring out a way to fit 400
discrete numbers into 20 buckets where the count of the numbers does
not matter, but the sum of the numbers should be roughly equal.

one way of accomplishing this is to sort descending, and loop through
the 400 files, putting the largest one in the first bucket, and going
down the line, putting the largest one that will fit into the current
bucket until you get to the end. granted, this is not perfect--perhaps
if you had waited, 2 smaller ones would have brought you closer to your
bucket-size goal. then you go to the next bucket, and do the same
thing. this is a O = n^2 solution though.

How so? N ln N to sort the files, only has to be done once. Finding the
biggest file that fits the current bucket is ln N using a binary search
into the sorted files. You have 1 N ln N operation, followed by N
operations that are lnN. That comes out to N ln N overall.

But anyway, why try to preordain the order in which the buckets become
empty for full? Start processing the 20 biggest files. When one of them
finishes (regardless of which one it is), start the next file.

Xho
 
I

it_says_BALLS_on_your forehead

I wouldn't bother with this 'bucket' stuff at all. Just do it on the fly.
By addressing the files in the appropriate order (from most work to least
work) you ensure nearly optimal processing. In fact, because there is no
guarantee that the actual time for a file to be processed is exactly
proportional to the file size,

you're right here, although i had the idea that i could weight certain
parse methods by multiplying each log size by a coefficient. the
coefficient would be derived by dividing the average speed of a parse
method by the average speed of the slowest parse method.

balancing on the fly is almost surely going
to be better than some precomputed balancing based on the assumption that
size = time.

$pm = new Parallel::ForkManager(20);

foreach $file (sort {$files{$b}<=>$files{$a}} keys %files) {
my $pid = $pm->start and next;
##Process the $file
$pm->finish; # Terminates the child process
}
$pm->wait_all_children;

...

i admit i'm not too familiar with Threads/Forks (the only fork i use is
the one called from system() ). also, i've read that Perl threading
isn't too stable. i've looked on the web a little, but have not found
anything that describes how to do all of the following:

1) instantiate N processes (or threads)
2) start each process parsing a log file
3) the first process that is done looks at a shared or global queue and
pulls the next log file from that and processes until the queue is
empty.


....the current architecture of my log processing is:
1) set a number of processes (e.g. 20)
2) in a loop for the number of processes:
my @rc;
for my $i (1..$num_processes) {
my $command = 'parseLog.pl $i';
$rc[$i] = system($command);
}
# there is a conf file that has an entry for each log, along with a
number in the next field--the number represents the process_id (can be
1 thru 20)
3) in a loop of all the logs, push logs into arrays if the process_id
== the $num_process that was passed along, so each process has an array
of files to process/parse. each process parses each file in its array
of files. problem with this is that maybe each process has a similar
number of logs to process (the process_id just increments for each
line, then wraps around once it reaches the max number of processes i
defined), but some could be huge while others are small, so not very
optimal. one process could have 20 files of 200 bytes each, while the
other could have 20 files of 230 MB each.

since using the system() approach is all i know, the only scenarios i
considered were those that dealt with providing each process with a
balanced amount of data.

if i can get the on-the-fly thing working, that would be preferable.
then sorting would not even be helpful, would it?
How so? N ln N to sort the files, only has to be done once. Finding the
biggest file that fits the current bucket is ln N using a binary search
into the sorted files. You have 1 N ln N operation, followed by N
operations that are lnN. That comes out to N ln N overall.

you would be roughly doing m x n iterations, which i believe amounts to
the same thing as big O of n^2. (i AM a bit rusty at algorithms). i
didn't even count the sort, since that's only done once, and i'm not
familiar with the under-the-hood mechanics of Perl's hash value sort.

But anyway, why try to preordain the order in which the buckets become
empty for full?

that's the only way i know how for now.

Start processing the 20 biggest files. When one of them
finishes (regardless of which one it is), start the next file.

if i can do the fork thing, why start with the biggest?
 
T

Tassilo v. Parseval

Also sprach it_says_BALLS_on_your forehead:
you're right here, although i had the idea that i could weight certain
parse methods by multiplying each log size by a coefficient. the
coefficient would be derived by dividing the average speed of a parse
method by the average speed of the slowest parse method.

balancing on the fly is almost surely going

i admit i'm not too familiar with Threads/Forks (the only fork i use is
the one called from system() ). also, i've read that Perl threading
isn't too stable.

It arguably still has its flaws, but for a task as easy as yours they're
perfectly usable.
i've looked on the web a little, but have not found
anything that describes how to do all of the following:

1) instantiate N processes (or threads)
2) start each process parsing a log file
3) the first process that is done looks at a shared or global queue and
pulls the next log file from that and processes until the queue is
empty.

Extremely easy with threads. Here's a complete example of a program that
spawns off a number of threads where each thread pulls data from a
global queue until it is empty:

#!/usr/bin/perl -w

use strict;

use threads;
use threads::shared;

use constant NUM_THREADS => 10;

# shared queue visible to every thread
my @queue : shared = 1 .. 30;

# create threads
my @threads;
push @threads, threads->new("run") for 1 .. NUM_THREADS;

# wait for all threads to finish
$_->join for @threads;

# code executed by each thread
sub run {
while (defined(my $element = pop @queue)) {
printf "thread %i: working with %i\n", threads->tid, $element;
# make runtime vary a little for
# demonstration purpose
select undef, undef, undef, rand;
}
}

...the current architecture of my log processing is:
1) set a number of processes (e.g. 20)
2) in a loop for the number of processes:
my @rc;
for my $i (1..$num_processes) {
my $command = 'parseLog.pl $i';
$rc[$i] = system($command);
}
# there is a conf file that has an entry for each log, along with a
number in the next field--the number represents the process_id (can be
1 thru 20)
3) in a loop of all the logs, push logs into arrays if the process_id
== the $num_process that was passed along, so each process has an array
of files to process/parse. each process parses each file in its array
of files. problem with this is that maybe each process has a similar
number of logs to process (the process_id just increments for each
line, then wraps around once it reaches the max number of processes i
defined), but some could be huge while others are small, so not very
optimal. one process could have 20 files of 200 bytes each, while the
other could have 20 files of 230 MB each.

Don't think in terms of processes. If you're using processes for that
kind of thing you'll need to find a way for them to communicate
(possibly pipes, or maybe shared memory). Threads takes this work off
your shoulders as they can share data in a simple and secure manner.
since using the system() approach is all i know, the only scenarios i
considered were those that dealt with providing each process with a
balanced amount of data.

Bad idea. It may take a different time for each piece of datum. The real
way is to store the work in one central repository and each thread
retrieves its working set from there. When it is done, it fetches the
next unless the central pool is empty.
if i can get the on-the-fly thing working, that would be preferable.
then sorting would not even be helpful, would it?

It is not helpful.
Start processing the 20 biggest files. When one of them

if i can do the fork thing, why start with the biggest?

Don't even worry about sorting. Use threads and have each thread do the
parsing of the files in any order. It makes no difference since it's
truely parallel and asynchronous.

Tassilo
 
I

it_says_BALLS_on_your forehead

Extremely easy with threads. Here's a complete example of a program that
spawns off a number of threads where each thread pulls data from a
global queue until it is empty:

#!/usr/bin/perl -w

use strict;

use threads;
use threads::shared;

use constant NUM_THREADS => 10;

# shared queue visible to every thread
my @queue : shared = 1 .. 30;

# create threads
my @threads;
push @threads, threads->new("run") for 1 .. NUM_THREADS;

# wait for all threads to finish
$_->join for @threads;

# code executed by each thread
sub run {
while (defined(my $element = pop @queue)) {
printf "thread %i: working with %i\n", threads->tid, $element;
# make runtime vary a little for
# demonstration purpose
select undef, undef, undef, rand;
}
}



Don't think in terms of processes. If you're using processes for that
kind of thing you'll need to find a way for them to communicate
(possibly pipes, or maybe shared memory). Threads takes this work off
your shoulders as they can share data in a simple and secure manner.

Tassilo, thank you very much for your help. If i could trouble you once
more for your insight...

What benefit does the thread model have that the following does not?
What drawbacks?

use Parallel::ForkManager;
$pm = new Parallel::ForkManager($MAX_PROCESSES);
foreach $data (@all_data) {
# Forks and returns the pid for the child:
my $pid = $pm->start and next;
... do some work with $data in the child process ...
$pm->finish; # Terminates the child process
}
 
I

it_says_BALLS_on_your forehead

Extremely easy with threads. Here's a complete example of a program that
spawns off a number of threads where each thread pulls data from a
global queue until it is empty:

i'm running Perl 5.6, are threads and threads::shared available/stable
in this version?

#!/usr/bin/perl -w

use strict;

use threads;
use threads::shared;

use constant NUM_THREADS => 10;

# shared queue visible to every thread
my @queue : shared = 1 .. 30;

# create threads
my @threads;
push @threads, threads->new("run") for 1 .. NUM_THREADS;

# wait for all threads to finish
$_->join for @threads;

# code executed by each thread
sub run {
while (defined(my $element = pop @queue)) {
printf "thread %i: working with %i\n", threads->tid, $element;

^ what is this?
 
P

Paul Lalli

it_says_BALLS_on_your forehead said:
^ what is this?

When you don't understand what a function is doing, the best way to
figure it out is to read the documentation. (And I admit, in this
case, the documenation will lead you on a trail, so let's follow it:)

perldoc -f printf
printf FILEHANDLE FORMAT, LIST
printf FORMAT, LIST
Equivalent to "print FILEHANDLE sprintf(FORMAT,
LIST)",

Okay, so we need to figure out what sprintf does instead:
perldoc -f sprintf
sprintf FORMAT, LIST
Returns a string formatted by the usual "printf"
conventions of the C library function "sprintf".
See below for more details
. . .
Finally, for backward (and we do mean "backward")
compatibility, Perl permits these unnecessary but
widely-supported conversions:

%i a synonym for %d
which, if we look back a bit, we see:
Perl's "sprintf" permits the following universally-
known conversions:
. . .
%d a signed integer, in decimal



So, the original code took the two arguments, threads->tid and
$element, converted them to integers (if necessary), and placed them at
the corresponding %i markers.

Paul Lalli
 
I

it_says_BALLS_on_your forehead

So, the original code took the two arguments, threads->tid and
$element, converted them to integers (if necessary), and placed them at
the corresponding %i markers.

Paul Lalli

Ahhh...thanks Paul :)
 
T

Tassilo v. Parseval

Also sprach it_says_BALLS_on_your forehead:
Tassilo, thank you very much for your help. If i could trouble you once
more for your insight...

What benefit does the thread model have that the following does not?
What drawbacks?

use Parallel::ForkManager;
$pm = new Parallel::ForkManager($MAX_PROCESSES);
foreach $data (@all_data) {
# Forks and returns the pid for the child:
my $pid = $pm->start and next;
... do some work with $data in the child process ...
$pm->finish; # Terminates the child process
}

Perl threads are said to be inefficient in that a Perl interpreter is
cloned for each thread. However, the Parallel::ForkManager approach
requires a new definition of efficiency. :)

Consider what happens in code like this:

foreach my $data (1 .. 100_000) {
$pm->start and next;
do_something($data);
$pm->finish;
}

For each of the 100.000 items a new process is spawned off. So unlike
with threads where a thread handles more than one item here each process
is essentially a once-and-throw-away thing. This is because
Parallel::ForkManager provides no infrastructure for sharing data
between these processes.

Just try this little program yourself and see how poorly it performs:

use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(30);
my @data = 1 ... shift;
foreach(@data) {
my $pid = $pm->start and next;
print "$_\n";
$pm->finish;
}
$pm->wait_all_children;

and compare it to an equivalent threads-implementation where the items
are stored in a shared array:

use threads;
use threads::shared;

use constant NUM_THREADS => 30;

my @queue : shared = 1 .. shift;
my @threads;

push @threads, threads->new("run") for 1 .. NUM_THREADS;
$_->join for @threads;

sub run {
while (defined(my $element = shift @queue)) {
print "$element\n";
}
}

On my machine I get:

ethan@ethan:~$ time perl procs.pl 2000
[...]
real 0m7.002s
user 0m2.500s
sys 0m4.500s

ethan@ethan:~$ time perl thread.pl 2000
[...]
real 0m3.141s
user 0m1.540s
sys 0m1.590s

If you increase the number further to, say, 10000, it's already

[processes]
real 0m45.605s
user 0m24.320s
sys 0m21.130s

[threads]
real 0m8.671s
user 0m1.090s
sys 0m7.580s

So apparently threads scale much better than processes which is no
wonder because threads only require a one-time initialization for
creating the threads whereas Parallel::ForkManager constantly has to
create and terminate new processes.

Besides, I find the code with threads much easier to understand: Each
thread has a function that it executes. With fork it's more tricky.
You first have to find the call to fork() and then one of the code
branches after that is for the child, the other one for the parent.

Tassilo
 
X

xhoster

i admit i'm not too familiar with Threads/Forks (the only fork i use is
the one called from system() ).

One advantage of Parallel::ForkManager is that you don't need to be all
that familiar with fork. It handles most of it for you, as long as you
follow the example of having a "$pm->start and next" near the top of the
loop and a "$pm->finish" at the end of the loop. ($pm->finish actually
calls exit in the child process, so anything between the finish and the end
of the loop is not executed.)

also, i've read that Perl threading
isn't too stable.

Forking on linux is rock stable. Forking on Windows is emulated using
threads, but I think it is stable enough for what you are doing.
i've looked on the web a little, but have not found
anything that describes how to do all of the following:

1) instantiate N processes (or threads)

$pm = new Parallel::ForkManager($N);
(Doesn't actual instantiate them, but declares how many you want
instantiated, once you get around to instantiating them.)
2) start each process parsing a log file

That is what the "foreach...$pm->start and next" does. It starts a process
on the next log file, unless there are already 20 (or $N) outstanding
processes. In that case, it waits for one of those outstanding processes to
end, then starts a process on the next log file.
3) the first process that is done looks at a shared or global queue and
pulls the next log file from that and processes until the queue is
empty.

ForkManager uses inversion of control (or at least something like it). The
first slave process that is done finishes. As part of finishing, it
notifies the master process. The master process keeps the queue, and uses
it to start the next process, to replace the one that finished.

....
if i can get the on-the-fly thing working, that would be preferable.
then sorting would not even be helpful, would it?

I find that it is helpful, especially when the length of the various
tasks vary by orders of magnitude.

Let's say your largest task will take 20 minutes for 1 process/CPU to
process, and all the rest of your tasks combined will take 20 minutes for
the other 19 CPUs to process. If you start the largest task first, then in
20 minutes you are done. If you start the largest task last, then say it
takes 15 minutes before it gets started[1], and then 20 minutes for it to
run, so the time to completeion is 35 minutes.

By starting the tasks it reverse order of run time, it lets the shorter
tasks pack around the longer ones in an efficient way.

(I just did a test on a uniform distribution of run-lengths[2], and
"processing" from long to short took 8:25 while short to long took 9:10. I
think the difference can be larger if the dispersion in runtimes is
greater)

Xho



[1] Since all-but-the-longest take 20 minutes to finish on 19 CPUs, they
will take ~19 minutes to finish on 20 CPUs (since we haven't yet started
the longest task, the shorter ones will have 20 CPUs to use, not 19).
However, the longest one doesn't need to wait for all of the shorter ones
to finish before it starts, it only needs to wait for 381 out of the 400 of
the shorter ones to finish. So I just pulled 15 minutes out of my ass, as
a guess of how long it will take for 381 of them to finish.

[2]
use strict;

use Parallel::ForkManager;

my $pm = new Parallel::ForkManager(10);

## do it with and without the "reverse"
foreach my $file (reverse 1..100) {
my $pid = $pm->start and next;
sleep $file;
$pm->finish; # Terminates the child process
}
$pm->wait_all_children;
 
T

Tassilo v. Parseval

Also sprach it_says_BALLS_on_your forehead:
i'm running Perl 5.6, are threads and threads::shared available/stable
in this version?

Oh, 5.6. In this case don't even bother using threads. The threads and
threads::shared module aren't yet available. What is available in 5.6 is
an older threads-model that was later replaced by the so called
"interpreter-threads", ithreads in short, which I was talking about.

If you're curious see 'perldoc perlthrtut' for your version of Perl.
These older threads were a pain especially since you had to write your
own code for safely sharing data between threads using lock(). It feels
more like programming with System V semaphores and other obscenities.

Tassilo
 
X

xhoster

it_says_BALLS_on_your forehead said:
Tassilo, thank you very much for your help. If i could trouble you once
more for your insight...

What benefit does the thread model have that the following does not?
What drawbacks?

use Parallel::ForkManager;
$pm = new Parallel::ForkManager($MAX_PROCESSES);
foreach $data (@all_data) {
# Forks and returns the pid for the child:
my $pid = $pm->start and next;
... do some work with $data in the child process ...
$pm->finish; # Terminates the child process
}


The ForkManager method does not allow the children to report back
to the parent in a meaningful way (i.e. beyond exit status). Threads do
readily allow that (although Tassilo's specific example doesn't).

There is probably more overhead in forking a process for each task than
there is in popping from a shared queue. Unless you have many thousands
of very very short tasks, this difference is probably negligible.

Xho
 
I

it_says_BALLS_on_your forehead

Tassilo,
I would like to upgrade to Perl 5.8 so I can use threads (as well as
for other reasons), but for now I think I'm stuck with a fork.

Xho,
I actually rummaged around Google Groups and saw some postings you
sent around 2002 about Parallel::ForkManager, so you must be expert by
now :). I'll try implementing this and see how it goes.

Xho and Tassilo--
Thank you both very much for your explanations :).
 
I

it_says_BALLS_on_your forehead

Hey Xho, I tried this:
----
#!/apps/webstats/bin/perl

use File::Copy;
use Parallel::ForkManager;


my $pm = Parallel::ForkManager->new(5);

$pm->run_on_start(
sub { my ($pid,$ident)=@_;
print "** $ident started, pid: $pid\n";
}
);

my @data = 1 ... shift;
for (@data) {
my $pid = $pm->start and next;
print "$pid: $_\n";
$pm->finish;
}

$pm->wait_all_children;
------------
and got this:
#####
[smro180 123] ~/simon/1-perl > tryFork.pl 10
** started, pid: 16208
0: 1
** started, pid: 16209
0: 2
** started, pid: 16210
0: 3
** started, pid: 16211
0: 4
** started, pid: 16212
0: 5
** started, pid: 16213
0: 6
** started, pid: 16214
0: 7
** started, pid: 16215
0: 8
** started, pid: 16216
0: 9
** started, pid: 16217
0: 10
[smro180 124] ~/simon/1-perl >
####

....I read this:
start [ $process_identifier ]
This method does the fork. It returns the pid of the child process for
the parent, and 0 for the child process. If the $processes parameter
for the constructor is 0 then, assuming you're in the child process,
$pm->start simply returns 0.

An optional $process_identifier can be provided to this method... It is
used by the "run_on_finish" callback (see CALLBACKS) for identifying
the finished process.

and this:
run_on_start $code
You can define a subroutine which is called when a child is started. It
called after the successful startup of a child in the parent process.

The parameters of the $code are the following:

- pid of the process which has been started
- identification of the process (if provided in the "start" method)



....but I don't understand why in my: print "$pid: $_\n";
line, i'm getting 0 as the pid. I know the documentation said i should
get 0 for the child process and the child pid for the parent, but
aren't i calling start on the parent? and in the callback, why are
there more than 5 different process ids?
 
I

it_says_BALLS_on_your forehead

and in the callback, why are
there more than 5 different process ids?

....i had initially assumed that i was creating 5 processes, and each
one would finish a job, and go back to the 'queue' (in this case, the
array). so i would see the same 5 pids over and over again. is this not
the case? does the number 5 here represent the number of processes that
are active or in existence *at one time*? so that there will be a total
of
scalar(@data) processes created overall, and once one is done, it will
die (not be re-used) and a new one will be created to replace it?
 
X

xhoster

Tassilo v. Parseval said:
Perl threads are said to be inefficient in that a Perl interpreter is
cloned for each thread. However, the Parallel::ForkManager approach
requires a new definition of efficiency. :)

Consider what happens in code like this:

foreach my $data (1 .. 100_000) {
$pm->start and next;
do_something($data);
$pm->finish;
}

For each of the 100.000 items a new process is spawned off. So unlike
with threads where a thread handles more than one item here each process
is essentially a once-and-throw-away thing. This is because
Parallel::ForkManager provides no infrastructure for sharing data
between these processes.

Just try this little program yourself and see how poorly it performs:

use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(30);
my @data = 1 ... shift;
foreach(@data) {
my $pid = $pm->start and next;
print "$_\n";
$pm->finish;
}
$pm->wait_all_children;

I agree that if you are going to do something incredibly silly, like
parallelizing something that has no legitimate need to be parallelized,
then using ForkManager is probably not the best choice. But then again,
doing silly things is generally not the best choice in the first place,
unless your goal is to be silly (which is a noble goal in itself,
sometimes).

and compare it to an equivalent threads-implementation where the items
are stored in a shared array:

Hardly equivalent. The equivalent thread implementation to the above fork
code would be to spawn one thread per item. The equivalent fork
implementation to the below threaded code would be to use forks::shared
(which, BTW, is even worse than the ForkManager way).

use threads;
use threads::shared;

use constant NUM_THREADS => 30;

my @queue : shared = 1 .. shift;
my @threads;

push @threads, threads->new("run") for 1 .. NUM_THREADS;
$_->join for @threads;

sub run {
while (defined(my $element = shift @queue)) {
print "$element\n";
}
}

On my machine I get: ....
If you increase the number further to, say, 10000, it's already

[processes]
real 0m45.605s
user 0m24.320s
sys 0m21.130s

[threads]
real 0m8.671s
user 0m1.090s
sys 0m7.580s

Parallelization is inherently an optimization step. As such, there is
no general solution and the appropriate way to parallelize is highly
dependent on the details of what is to be parallelized. If I wanted to
parallelize something like your test case, consisting of a large number of
very very fast operations, I would use the unpublished "Parallel_Proc"
module, which divides into chunks up front.

use Parallel_Proc;

my $pm = Parallel_Proc->new();
my @data = 1 ... shift;

my ($l,$r)=$pm->spawn(30,scalar @data);
foreach(@data[$l..$r]) {
print "$_\n";
}
$pm->harvest(sub{print $_[0]}); #pass childs output out the parent's STDOUT
$pm->Done();

__END__

On my machine:

~/perl_misc]$ time perl thread.pl 500000 > /dev/null
5.740u 0.320s 0:06.16 98.3% 0+0k 0+0io 447pf+0w

~/perl_misc]$ time perl parallel_proc.pl 500000 > /dev/null
2.320u 0.410s 0:02.75 99.2% 0+0k 0+0io 454pf+0w

That was on a single CPU machine. On a four CPU machine:

$ time /usr/bin/perl thread.pl 500000 > /dev/null
11.212u 23.806s 0:18.00 194.5% 0+0k 0+0io 0pf+0w

$ time /usr/bin/perl parallel_proc.pl 500000 > /dev/null
1.631u 0.572s 0:01.65 133.3% 0+0k 0+0io 0pf+0w

So apparently threads scale much better than processes which is no
wonder because threads only require a one-time initialization for
creating the threads whereas Parallel::ForkManager constantly has to
create and terminate new processes.

While this is true, it is not particularly relevant to the poster's
problem. There are cases where threading wins hands down. This original
problem is not one of them.
Besides, I find the code with threads much easier to understand: Each
thread has a function that it executes. With fork it's more tricky.
You first have to find the call to fork() and then one of the code
branches after that is for the child, the other one for the parent.

I somewhat agree. Parallel::ForkManger was (apparently) designed so that
you can usually take code originally written to be serial, and make it
parallel by simply adding 2 carefully-placed lines (plus 3 house-keeping
lines). The threaded code is pretty much written from the ground up to be
threaded. The threaded code structure tends to be dominated by the
threading, while the ForkManager code tends to be dominated by whatever you
are fundamentally trying to do, which just a few lines making a nod to the
parallelization. This makes it easier to thoughtless add code that breaks
parallelization under ForkManager. So when I substantially refactor code
that uses ForkManager, I simply remove the parallelization, refactor the
code as serial code, then add ForkManager back in at the end.

Oh, one more thing I discovered. Threaded code with a shared queue is
tricky to do if the queue holds references or objects.

If you change the queue to:

my @queue : shared = map {[$_]} 1 .. shift;

Then it dies with "Invalid value for shared scalar". Since the forking
code doesn't use shared values, it doesn't have this particular problem.

You can circumvent this with the rather ugly:

my @queue : shared = map {my $x =[$_]; share $x; $x} 1 .. shift;

With blessed reference, this doesn't work.

(Yes, I know that this is rather far afield of the original question, but
while I'm dissertating on the subject I might as well go whole hog.)

Xho
 
X

xhoster

it_says_BALLS_on_your forehead said:
...i had initially assumed that i was creating 5 processes, and each
one would finish a job, and go back to the 'queue' (in this case, the
array). so i would see the same 5 pids over and over again. is this not
the case? does the number 5 here represent the number of processes that
are active or in existence *at one time*? so that there will be a total
of
scalar(@data) processes created overall, and once one is done, it will
die (not be re-used) and a new one will be created to replace it?

Correct.

Xho
 
X

xhoster

it_says_BALLS_on_your forehead said:
Hey Xho, I tried this:
----
#!/apps/webstats/bin/perl

use File::Copy;
use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(5);

$pm->run_on_start(
sub { my ($pid,$ident)=@_;
print "** $ident started, pid: $pid\n";
}
);

my @data = 1 ... shift;
for (@data) {
my $pid = $pm->start and next;
print "$pid: $_\n";
$pm->finish;
}

$pm->wait_all_children;
------------
and got this:
#####
[smro180 123] ~/simon/1-perl > tryFork.pl 10
** started, pid: 16208
0: 1
** started, pid: 16209
0: 2
** started, pid: 16210 ....

...I read this:
start [ $process_identifier ]
This method does the fork. It returns the pid of the child process for
the parent, and 0 for the child process. If the $processes parameter
for the constructor is 0 then, assuming you're in the child process,
$pm->start simply returns 0.

An optional $process_identifier can be provided to this method... It is
used by the "run_on_finish" callback (see CALLBACKS) for identifying
the finished process.

and this:
run_on_start $code
You can define a subroutine which is called when a child is started. It
called after the successful startup of a child in the parent process.

The parameters of the $code are the following:

- pid of the process which has been started
- identification of the process (if provided in the "start" method)

...but I don't understand why in my: print "$pid: $_\n";
line, i'm getting 0 as the pid. I know the documentation said i should
get 0 for the child process and the child pid for the parent, but
aren't i calling start on the parent?

You are calling "start" *in* the parent, but is returning in both the
parent and child process. Inside, "start" does a fork, so when "start"
ends there are two processes. The parent process gets the child's pid,
which means the "and next" is activated. The child gets zero, so the "and
next" is not activated. This means everything between the start and the
finish statements are done in one of the children, not in the parent.

The example I posted was just copied and modified from perldoc, and for
some reason they do capture the pid. In practise I almost never capture
it:

$pm->start and next;

If the child needs it's own pid, it gets it from $$. Why do I need
the parent to know the child's pid? Usually I don't, because the module
itself takes care of all the waiting and stuff for me.

I rarely use anything except new, start, finish, and wait_all_children,
except to goof around with. Once your needs get more complicated than
those simple methods, I find that things get hairy real quick.

BTW, I'm curious about the bottleneck in your code. If your code is
CPU-bound, then parallelization to 20 processes won't help much unless you
have 20 CPUs. If it is disk-drive bound, then parallelization won't help
unless your files are on different disks (and probably on different
controllers.)

Xho
 
I

it_says_BALLS_on_your forehead

it_says_BALLS_on_your forehead said:
Hey Xho, I tried this:
----
#!/apps/webstats/bin/perl

use File::Copy;
use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(5);

$pm->run_on_start(
sub { my ($pid,$ident)=@_;
print "** $ident started, pid: $pid\n";
}
);

my @data = 1 ... shift;
for (@data) {
my $pid = $pm->start and next;
print "$pid: $_\n";
$pm->finish;
}

$pm->wait_all_children;
------------
and got this:
#####
[smro180 123] ~/simon/1-perl > tryFork.pl 10
** started, pid: 16208
0: 1
** started, pid: 16209
0: 2
** started, pid: 16210 ...

...I read this:
start [ $process_identifier ]
This method does the fork. It returns the pid of the child process for
the parent, and 0 for the child process. If the $processes parameter
for the constructor is 0 then, assuming you're in the child process,
$pm->start simply returns 0.

An optional $process_identifier can be provided to this method... It is
used by the "run_on_finish" callback (see CALLBACKS) for identifying
the finished process.

and this:
run_on_start $code
You can define a subroutine which is called when a child is started. It
called after the successful startup of a child in the parent process.

The parameters of the $code are the following:

- pid of the process which has been started
- identification of the process (if provided in the "start" method)

...but I don't understand why in my: print "$pid: $_\n";
line, i'm getting 0 as the pid. I know the documentation said i should
get 0 for the child process and the child pid for the parent, but
aren't i calling start on the parent?

You are calling "start" *in* the parent, but is returning in both the
parent and child process. Inside, "start" does a fork, so when "start"
ends there are two processes. The parent process gets the child's pid,
which means the "and next" is activated. The child gets zero, so the "and
next" is not activated. This means everything between the start and the
finish statements are done in one of the children, not in the parent.

The example I posted was just copied and modified from perldoc, and for
some reason they do capture the pid. In practise I almost never capture
it:

$pm->start and next;

If the child needs it's own pid, it gets it from $$. Why do I need
the parent to know the child's pid? Usually I don't, because the module
itself takes care of all the waiting and stuff for me.

I rarely use anything except new, start, finish, and wait_all_children,
except to goof around with. Once your needs get more complicated than
those simple methods, I find that things get hairy real quick.

BTW, I'm curious about the bottleneck in your code. If your code is
CPU-bound, then parallelization to 20 processes won't help much unless you
have 20 CPUs. If it is disk-drive bound, then parallelization won't help
unless your files are on different disks (and probably on different
controllers.)

Xho

ahh, that makes sense, thanks!

to answer your question, i'm working on a box with 16 CPUs. the number
20 is from code that i inherited from a predecessor. there used to be
10 processes, and he changed it to 20, and it went faster, so 20 it
stayed. should i change it to 16?

also, what's the difference between using Parallel::ForkManager to do
20 tasks, and looping through system('script.pl &') 20 times? i mean, i
see an advantage in that with ForkManager, when one processes dies,
another takes its place so you don't need to pre-ordain which process
does which work. but let's assume that each process has exactly the
same amount of work and processes that work with the same speed. would
ForkManager be faster? Is there ever a case where multiple system()
calls is the answer?

 
X

xhoster

it_says_BALLS_on_your forehead said:
ahh, that makes sense, thanks!

to answer your question, i'm working on a box with 16 CPUs. the number
20 is from code that i inherited from a predecessor. there used to be
10 processes, and he changed it to 20, and it went faster, so 20 it
stayed. should i change it to 16?

There is no real reason to use a max of 20 (or anything else greater than
16) if you only have 16 CPUs, but using 20 shouldn't hurt, either. There
may be slightly more kernel overhead with 20 than 16, but on modern OSes
the difference will be negligible.

One way it could matter is if other process are also using the machine.
For example, if there are 8 other CPU-bound processes (all of same nice
level as yours), then if you use max of 16 you will get ~16/24 of the CPU
time and the other programs will get ~8/24, while if you use a max of 20,
then you will get ~20/28 of the time and they will get ~8/28.

also, what's the difference between using Parallel::ForkManager to do
20 tasks, and looping through system('script.pl &') 20 times?

In that case (when the number of tasks to do is less than or equal to the
maximum number of processes you want running at one time) the biggest
difference is that ForkManager forks a copy of the currently running
perl interpreter, which is very fast, while system("foo.pl&") will cause a
brand new interpreter to start up, which is a lot slower than forking a
copy.[1] (Although both of these are fast enough it doesn't really matter
if you are only doing it 20 times. (and if you are doing it far more than
20 times, than the initial assumption of this paragraph is not met))

Of course, this also means that the new interpreter started with "system"
is independent of the old one. This could be good (i.e. its name space
is not contaminated, it doesn't corrupt the parent's database handles, its
file locks are independent from the parent's etc.) but is at least as
likely to be bad (it has to redo the "use"s and reparse any input that
might already have been "use"d or parsed in the original interpreter, it
doesn't readily inherit its parent's file locks, etc.)
i mean, i
see an advantage in that with ForkManager, when one processes dies,
another takes its place so you don't need to pre-ordain which process
does which work.

But of course that only matters if there are more tasks to run than there
are simultaneous processes you wish to allow. ForkManager is great if you
want to run 400 tasks, but want to have only 20 going at any one time. If
you only want to run 20 in the first place and will do them all at the same
time, that removes a lot (but not all) of the benefit of using ForkManager.
but let's assume that each process has exactly the
same amount of work and processes that work with the same speed. would
ForkManager be faster?

Probably, but only because simple forking is faster than "system".
Is there ever a case where multiple system()
calls is the answer?

If the child processes spawned by ForkManager have to call "system" or the
moral equivalent anyway (for example, because the task you need to run is
an external executable, not Perl code. Or it is Perl code but you need to
run it in a standalone interpreter for some reason) then there is no reason
to combine both ForkManager and system, just go with system. But again,
this only applies if you don't need to throttle the number of simultaneous
processes.

If the tasks you need to do are very short and you need to do a very large
number of them (400 is *not* a very large number), you should try to batch
them up so that each process/thread does bunch of them. This is a basic
matter of overhead, and is largely true regardless of whether you use
ForkManager, system, threads or something else.


Xho


[1] Actually, "system" will internally first fork a copy of the current
interpreter, and then do an exec, which discards this forked copy and
causes a new intepreter to be started. At least on Linux/Unix, which I
suspect you are using as I've never seen a 16 CPU Windows machine.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top