I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.
Currently my script is reading each file line by line. So it will take
lot of time to process all the files.
Is there any efficient way to do it?
May be Multiprocessing, Multitasking ?
Here is one way to do it using Parallel::Forkmanager.
If your system is somewhat typical, you'll probably run into an IO
bottleneck before you run into a CPU bottleneck.
For example:
C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat create.pl
#!/usr/bin/perl
use strict;
use warnings;
my $line = join("\t", qw( 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 )) .
"\n";
my $fn_tmpl = 'data_%2.2d.txt';
my $fn = sprintf $fn_tmpl, 0;
open my $out, '>', $fn
or die "Cannot open '$fn': $!";
for (1 .. 100_000) {
print $out $line
or die "Cannot write to '$fn': $!";
}
close $out
or die "Cannot close: '$fn': $!";
for (1 .. 19) {
system copy => $fn, sprintf($fn_tmpl, $_);
}
C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis create
....
TimeThis : Command Line : create
TimeThis : Start Time : Thu Dec 11 18:14:12 2008
TimeThis : End Time : Thu Dec 11 18:14:16 2008
TimeThis : Elapsed Time : 00:00:03.468
Now, you have 20 input files with 100_000 lines each:
C:\DOCUME~1\asu1\LOCALS~1\Temp\large> dir
....
2008/12/11 06:14 PM 4,100,000 data_00.txt
2008/12/11 06:14 PM 4,100,000 data_01.txt
2008/12/11 06:14 PM 4,100,000 data_02.txt
2008/12/11 06:14 PM 4,100,000 data_03.txt
2008/12/11 06:14 PM 4,100,000 data_04.txt
2008/12/11 06:14 PM 4,100,000 data_05.txt
2008/12/11 06:14 PM 4,100,000 data_06.txt
2008/12/11 06:14 PM 4,100,000 data_07.txt
2008/12/11 06:14 PM 4,100,000 data_08.txt
2008/12/11 06:14 PM 4,100,000 data_09.txt
2008/12/11 06:14 PM 4,100,000 data_10.txt
2008/12/11 06:14 PM 4,100,000 data_11.txt
2008/12/11 06:14 PM 4,100,000 data_12.txt
2008/12/11 06:14 PM 4,100,000 data_13.txt
2008/12/11 06:14 PM 4,100,000 data_14.txt
2008/12/11 06:14 PM 4,100,000 data_15.txt
2008/12/11 06:14 PM 4,100,000 data_16.txt
2008/12/11 06:14 PM 4,100,000 data_17.txt
2008/12/11 06:14 PM 4,100,000 data_18.txt
2008/12/11 06:14 PM 4,100,000 data_19.txt
Here is a simple program to process the data:
C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat process.pl
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my ($instances) = @ARGV;
my $fn_tmpl = 'data_%2.2d.txt';
my $pm = Parallel::ForkManager->new($instances);
for my $i (0 .. 19 ) {
$pm->start and next;
my $input = sprintf $fn_tmpl, $i;
eval {
open my $in, '<', $input
or die "Cannot open '$input': $!";
while ( my $line = <$in> ) {
my @data = split /\t/, $line;
# replace with your own processing code
# don't try to keep all your data in memory
}
close $in
or die "Cannot close '$input': $!";
};
warn $@ if $@;
$pm->finish;
}
$pm->wait_all_children;
__END__
First, try without forking to establish a baseline:
C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis process 0
TimeThis : Command Line : process 0
TimeThis : Start Time : Thu Dec 11 18:31:50 2008
TimeThis : End Time : Thu Dec 11 18:32:41 2008
TimeThis : Elapsed Time : 00:00:51.156
Let's try a few more:
TimeThis : Command Line : process 2
TimeThis : Start Time : Thu Dec 11 18:35:15 2008
TimeThis : End Time : Thu Dec 11 18:35:58 2008
TimeThis : Elapsed Time : 00:00:43.578
TimeThis : Command Line : process 4
TimeThis : Start Time : Thu Dec 11 18:36:17 2008
TimeThis : End Time : Thu Dec 11 18:36:59 2008
TimeThis : Elapsed Time : 00:00:41.921
TimeThis : Command Line : process 8
TimeThis : Start Time : Thu Dec 11 18:37:18 2008
TimeThis : End Time : Thu Dec 11 18:38:00 2008
TimeThis : Elapsed Time : 00:00:41.328
TimeThis : Command Line : process 16
TimeThis : Start Time : Thu Dec 11 18:38:18 2008
TimeThis : End Time : Thu Dec 11 18:38:58 2008
TimeThis : Elapsed Time : 00:00:40.734
TimeThis : Command Line : process 20
TimeThis : Start Time : Thu Dec 11 18:39:17 2008
TimeThis : End Time : Thu Dec 11 18:39:58 2008
TimeThis : Elapsed Time : 00:00:40.578
Not very impressive. Between no forking vs max 20 instances, time
required to process was reduced by 20% with most of the gains coming
from running 2. That probably has more to do with the implementation of
fork on Windows than anything else.
In fact, I should probably have used threads on Windows. Anyway, I'll
boot into Linux and see if the returns there are greater.
Try this simple experiment on your system. See how many instances gives
you the best bang-per-buck.
Sinan
--
A. Sinan Unur <
[email protected]>
(remove .invalid and reverse each component for email address)
comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/