Can this be more efficient?

Colin chaplin · Mar 15, 2005

Hi All

I've got a little quick and dirty perl program that scans through logfiles
for certain words and counts them up. Trouble is, it's quite slow and I was
wondering if you smart chaps could suggest a way of speeding it up. The
logfiles are about 50MB in size, and I'm running it on a reasonably powerful
PC

The Psuedo code is

For each logfile in the directory:
read into an @rray
close file
for each line of array
do text processing stuff

print results on screen

I've played about with not reading into an array just reading the file line
by line, no obvious change either way

As I said its written using basic perl data types, so I was wondering if
there is a quicker way to store and process this data ?

Paul Lalli · Mar 15, 2005

Colin chaplin said:
Hi All

I've got a little quick and dirty perl program that scans through logfiles
for certain words and counts them up. Trouble is, it's quite slow and I was
wondering if you smart chaps could suggest a way of speeding it up. The
logfiles are about 50MB in size, and I'm running it on a reasonably powerful
PC

The Psuedo code is

For each logfile in the directory:
read into an @rray
close file
for each line of array
do text processing stuff

print results on screen

I've played about with not reading into an array just reading the file line
by line, no obvious change either way

I don't know about speed efficiency, but this is certainly more
efficient space-wise.

How are you determining "no obvious change"? Have you used the
Benchmark module?

Also, is there any reason you can't just use grep, rather than entire
perl script?

if you really want Perl...

perl -ne '$words++ if /\bsearchedWord\b/; END { print "$words words
found\n" }' *.log

(note that this assumes the word will only appear a maximum of once per
line.

Paul Lalli

Colin chaplin · Mar 15, 2005

I don't know about speed efficiency, but this is certainly more
efficient space-wise.

How are you determining "no obvious change"? Have you used the
Benchmark module?

By the very scientific "gosh this is taking a long time" make-a-cup-of-tea,
tidy-up, "oh it's still going" approach. I was looking for a magic bullet
perhaps that'd make it work in a matter of seconds. Not so much to speed
this up as it's only going to run a few times, but so in future when I do
similar it's better.

Also, is there any reason you can't just use grep, rather than entire
perl script?

I was being a little brief with my description, there's quite a lot of
pattern matching going on there that I use a function for. Almost do-able in
grep I guess but apart from speed the program works pretty well so I'm happy
to keep with it. Id post it here but I'd probably only get slated for coding
style >

if you really want Perl...

perl -ne '$words++ if /\bsearchedWord\b/; END { print "$words words
found\n" }' *.log

I can't get that syntax to run from the command line, not sure why (NT
platform "can't find string terminator ' before EOF") but if I understand
the -ne switch it's effectively openinging every file, going through it line
by line, which is pretty much what I'm doing now ?

I did just have a look at $/ but not sure that would help me out.

Thanks for your time!

Colin

Tad McClellan · Mar 15, 2005

Colin chaplin said:
for certain words and counts them up. Trouble is, it's quite slow

The Psuedo code is

For each logfile in the directory:
read into an @rray
close file
for each line of array
do text processing stuff

print results on screen

The Psuedo answer is: improve the "do text processing stuff" part.

Of course we can't help you with that, since we cannot see that part...

Paul Lalli · Mar 15, 2005

Colin chaplin said:
Paul Lalli wrote

I was being a little brief with my description,

You should always specify your description as completely as possible.
How else can you expect people to help you do what you want to do?

there's quite a lot of
pattern matching going on there that I use a function for. Almost do-able in
grep I guess but apart from speed the program works pretty well so I'm happy
to keep with it. Id post it here but I'd probably only get slated for coding
style >

You say that as though it's a bad thing...

I can't get that syntax to run from the command line, not sure why (NT
platform "can't find string terminator ' before EOF")

Windows requires args to be delimited by " instead of '. Which means
you'll have to backslash all the " in the code.

but if I understand
the -ne switch it's effectively openinging every file, going through it line
by line, which is pretty much what I'm doing now ?

No. What you said you were currently doing is reading the file line by
line, storing each line onto an array, and then going through the array
line by line. Quite different.

Paul Lalli

xhoster · Mar 15, 2005

Colin chaplin said:
Hi All

I've got a little quick and dirty perl program that scans through
logfiles for certain words and counts them up. Trouble is, it's quite
slow and I was wondering if you smart chaps could suggest a way of
speeding it up. The logfiles are about 50MB in size, and I'm running it
on a reasonably powerful PC

The Psuedo code is

For each logfile in the directory:
read into an @rray
close file
for each line of array
do text processing stuff

print results on screen

So, how about not doing the text processing stuff? If it is still slow,
then you know the slow step is one of the earlier steps. If not still
slow, you know it is the text processing stuff. Then you can show is the
real code used to do the part that is slow. I give us real numbers, so we
can try to replicate the problem.

Xho

Ala Qumsieh · Mar 15, 2005

Colin said:
Id post it here but I'd probably only get slated for coding
style >

What you're saying is akin to me going to a car mechanic and telling her
"my car is broken. how do I fix it?" without me allowing her to see the car.

You won't get any useful replies like that. Plus, a little critique of
your coding style should be a Good Thing (tm).

--Ala

Colin chaplin · Mar 15, 2005

Ala Qumsieh said:
What you're saying is akin to me going to a car mechanic and telling her
"my car is broken. how do I fix it?" without me allowing her to see the car.

You won't get any useful replies like that. Plus, a little critique of
your coding style should be a Good Thing (tm).

Ok, here goes, warts and all (I removed names to protect the guilty). The
program reads through Mailsweeper log files and spits out how many emails
for certain domains are recorded in the logs, and also a list of recipients
and number of emails they receive. The log is a trimmed version of an SMTP
communication.

Please note:
* I'm not a programmer, be gentle
* This isnt a production program
* Clever regular expressions make my head explode when I look at them a
while after writing the code
* I'm barely a techy these days
* Don't run this scissors

In answer to another post, I had the program setup so that it read line by
line, and changed it to gobble the entire file at once to see if would have
a performance impact

Thanks to all for taking the time!

&opendir (".");
$DOMS{'thisdom'}=12;

sub opendir
{
# Recursively goes through directories and process all mailsweeper files
my $dir=$_[0];
my $name;
opendir(DIRHANDLE, $dir) || die "Cannot opendir $dir: $!";
foreach $name (sort readdir(DIRHANDLE))
{
if (($name eq '.') || ($name eq '..') || (lc($name) eq lc($currentlog)))
{

}
else
{

# Only open the file if it looks like a MSW Logfile
if ($name =~ /\Aopr[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\.log/i)
{

&processFile($dir . "\\" . $name)

}
if (-d $dir . "\\" . $name)
{
print "opening " .$dir . "\\" . $name ."\n";
&opendir($dir . "\\" . $name);
}
}
}

closedir(DIRHANDLE);
}

sub processFile
{
my $filename=$_[0];
my @filetext;

print " Opening $filename\n";
open (INFILE,$filename) || die ("EK $!: $filename");
while (<INFILE>)
{
push(@filetext,$_);
}
close (INFILE);
foreach (@filetext)
{
if (/RCPT\sTO\:/i)
{
#print "[$_]\n";

if (/(\<.*?\>)/)
{

$bits=$1;
$bits =~ s/\<//g;
$bits =~ s/\>//g;
$bits =~ s/RCPT TO\://g;
$bits =~ s/\s//g;
$bits = lc ($bits);
if (&valid_domain($bits))
{
$DOMS{$bits}++;
}
# print $bits;
}
}
# else {print "NO: $_\n";}
}

}
$DOMMER{'testdom'}=5;

open (OUTFILE,">allemails.csv") || die ("eeK:$!");
open (OUTF,">alldoms.csv") || die ("yikes:$!");
foreach $key ( sort by_this keys %DOMS )
{

($pre,$aft) = split(/\@/,$key);
$DOMMER{$aft}=$DOMMER{$aft}+1;
print OUTFILE "\"$key\",\"$DOMS{$key}\"\n";

}

foreach $key ( sort by_this keys %DOMMER )
{
print OUTF "\"$key\",\"$DOMMER{$key}\"\n";

}

sub by_this
{
($tat,$abit)=split (/\@/,$a);
($tat,$bbit)=split (/\@/,$b);
($abit cmp $bbit) || ($a cmp $b);
}

sub valid_domain
{
#decide if this is a xxxxx doman

my $dom=$_[0];
($tat,$dom)=split(/\@/,$dom);
if ($dom =~ (/thisdomain\.co\.uk/)) {return 1;}
if ($dom =~(/thatdomain.co\.uk/)) {return 1;}
if ($dom =~(/otherdomain\.com/)) {return 1;}

return (0);
}

Martin Kissner · Mar 15, 2005

Colin chaplin wrote :

I was being a little brief with my description, there's quite a lot of
pattern matching going on there that I use a function for.

The regexes might be the reason, why the script takes such a long time
to run.
I am pretty new to perl and I am reading the Camel Book.
In the Chapter 5 Pattern-Matching there are examples which are said to
take years (actually millions of years) because of backtracking.

So one approach to improve the performance of your script might be the
optimization of you regxes.

HTH
Martin

Fabian Pilkowski · Mar 15, 2005

* Colin chaplin said:
Please note:
* I'm not a programmer, be gentle

.... but learning something new cannot be wrong, isn't?

* This isnt a production program

.... but speeding up a program is mostly interesting.

* Clever regular expressions make my head explode when I look at them a
while after writing the code

.... when clever regexes would be necessary I'll took them (and write a
short comment aside to remember me after a while).

* I'm barely a techy these days
* Don't run this scissors

In answer to another post, I had the program setup so that it read line by
line, and changed it to gobble the entire file at once to see if would have
a performance impact

Thanks to all for taking the time!

&opendir (".");
$DOMS{'thisdom'}=12;

sub opendir {
# Recursively goes through directories and process all mailsweeper files
[...]

}

For me it looks smarter, doing such a recursive directory traversal with
a standard perl module like File::Find. Your sub could look like:

use File::Find;
sub my_opendir {
my $dir = shift;
find( sub {
# $_ contains the current filename only, while
# $File::Find::name contains the complete pathname
if ( $_ =~ /opr\d{8}\.log$/i ) {
processFile( $File::Find::name )
}
}, $dir );
}

Btw, it's always a bad idea to name your own functions like perl ones.
Hence I named your sub »my_opendir« instead of »opendir«.

sub processFile {
my $filename=$_[0];
my @filetext;
print " Opening $filename\n";
open (INFILE,$filename) || die ("EK $!: $filename");
while (<INFILE>){
push(@filetext,$_);
}
close (INFILE);

Eh, this will read the complete file into the array @filetext. That can

foreach (@filetext){
if (/RCPT\sTO\:/i){
#print "[$_]\n";
if (/(\<.*?\>)/){
$bits=$1;
$bits =~ s/\<//g;
$bits =~ s/\>//g;

If you don't catch those angle brackets in $1 you haven't delete them.

$bits =~ s/RCPT TO\://g;
$bits =~ s/\s//g;
$bits = lc ($bits);

The search pattern is modified by »i« previously, i.e. case-insensitive..
Here you delete all uppercased occurrences of "RCPT TO:" but not the
lowercased ones. Perhaps you should do the lc() before.

if (&valid_domain($bits)){
$DOMS{$bits}++;
}
# print $bits;
}
}
# else {print "NO: $_\n";}
}
}

Using a while loop to read in a file line by line:

sub processFile {
my $filename = shift;
print " Opening $filename\n";
open INFILE, $filename or die "EK $!: $filename";
while ( <INFILE> ) {
if ( /RCPT\sTO\:/i and /<(.*?)>/ ) {
my $bits = lc $1;
$bits =~ s/RCPT TO\://g;
$bits =~ s/\s//g;
$DOMS{$bits}++ if valid_domain($bits);
}
}
close INFILE;
}

$DOMMER{'testdom'}=5;

open (OUTFILE,">allemails.csv") || die ("eeK:$!");
open (OUTF,">alldoms.csv") || die ("yikes:$!");
foreach $key ( sort by_this keys %DOMS )
{

($pre,$aft) = split(/\@/,$key);
$DOMMER{$aft}=$DOMMER{$aft}+1;
print OUTFILE "\"$key\",\"$DOMS{$key}\"\n";

Before using such backslashed quotes, learn more about perl's quoting
operators like q// and qq//.

print OUTFILE qq{"$key","$DOMS{$key}"\n};

}

foreach $key ( sort by_this keys %DOMMER )
{
print OUTF "\"$key\",\"$DOMMER{$key}\"\n";

}

sub by_this
{
($tat,$abit)=split (/\@/,$a);
($tat,$bbit)=split (/\@/,$b);
($abit cmp $bbit) || ($a cmp $b);
}

If a function returns values which aren't interesting this moment, just
throw them away:

( undef, $abit ) = split /@/, $a;

sub valid_domain
{
#decide if this is a xxxxx doman

my $dom=$_[0];
($tat,$dom)=split(/\@/,$dom);

See above and throw $tat away ;-)

if ($dom =~ (/thisdomain\.co\.uk/)) {return 1;}
if ($dom =~(/thatdomain.co\.uk/)) {return 1;}
if ($dom =~(/otherdomain\.com/)) {return 1;}

In Perl you can note if-statements behind. In cases like this I think
it's more readable due to fewer parenthesis ;-)

return 1 if $dom =~ /thisdomain\.co\.uk/;
return 1 if $dom =~ /thatdomain\.co\.uk/;
return 1 if $dom =~ /otherdomain\.com/;

You could do it with »or« also:

return 1 if $dom =~ /thisdomain\.co\.uk/
or $dom =~ /thatdomain\.co\.uk/
or $dom =~ /otherdomain\.com/;

I hope you like my suggestions.

regards,
fabian

Ala Qumsieh · Mar 15, 2005

Colin said:
&opendir (".");

In general, it's not a good idea to use & when calling a subroutine.
This was necessary for Perl4 and before, but that is at least 10 years
old. Now, they are not needed, and using them has some side effects
which are described in perlsub.

$DOMS{'thisdom'}=12;

sub opendir
{
# Recursively goes through directories and process all mailsweeper files

As suggested by another poster, it's better to use File::Find in this case.

my $dir=$_[0];
my $name;
opendir(DIRHANDLE, $dir) || die "Cannot opendir $dir: $!";
foreach $name (sort readdir(DIRHANDLE))
{
if (($name eq '.') || ($name eq '..') || (lc($name) eq lc($currentlog)))
{

}

An empty if() body is never a nice thing to look at. I would change this to:

next if $name eq '.'
|| $name eq '..'
|| lc($name) eq lc($currentlog);

This also allows you to save one level of indentation, which can make
the code a bit more aesthetically pleasing. But that's subjective.

else
{

# Only open the file if it looks like a MSW Logfile
if ($name =~ /\Aopr[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\.log/i)

[0-9] is identical to \d. And you can match a specific number of \d's by
using curly braces:

if ($name =~ /\Aopr\d{8}\.log/i) {

{

&processFile($dir . "\\" . $name)

Perl allows you to use unix-style dir hierarchy delimiters, even on
Windows. This is more readable (IMO):

processFile("$dir/$name");

}
if (-d $dir . "\\" . $name)

That should've been an 'elsif' .. but I don't think it will improve much
in your case.

[snip]

open (INFILE,$filename) || die ("EK $!: $filename");
while (<INFILE>)
{
push(@filetext,$_);
}

The following is more efficient:
@filetext = <INFILE>;

But you don't need to do that ... see below.

close (INFILE);
foreach (@filetext)

Why not execute this loop while reading the file in the first place?

while ( said:
{
if (/RCPT\sTO\:/i)
{
#print "[$_]\n";

if (/(\<.*?\>)/)
{

$bits=$1;
$bits =~ s/\<//g;
$bits =~ s/\>//g;
$bits =~ s/RCPT TO\://g;
$bits =~ s/\s//g;
$bits = lc ($bits);

All of the above can probably be optimized to this:

if (/RCPT TO:\s*<(.*?)\>/) {
my $bits = lc $1;

[more snips]

sub valid_domain
{
#decide if this is a xxxxx doman

my $dom=$_[0];
($tat,$dom)=split(/\@/,$dom);
if ($dom =~ (/thisdomain\.co\.uk/)) {return 1;}
if ($dom =~(/thatdomain.co\.uk/)) {return 1;}
if ($dom =~(/otherdomain\.com/)) {return 1;}

If you are matching literal strings, using 'eq' is faster than a regexp:

return 1 if $dom eq 'thisdomain.co.uk'
|| $dom eq 'thatdomain.co.uk'
|| $dom eq 'otherdomain.co.uk';

return (0);
}

--Ala

Tad McClellan · Mar 15, 2005

Colin chaplin said:
Please note:
* I'm not a programmer, be gentle
* This isnt a production program

Then there is no need to optimize it.

* Clever regular expressions make my head explode when I look at them a
while after writing the code

Join the club.

sub opendir
[snip]

if (($name eq '.') || ($name eq '..') || (lc($name) eq lc($currentlog)))
{

}

An empty block should be a red flag.

You can use next to eliminate an entire level of indent:

next if $name eq '.' or $name eq '..' or lc $name eq lc $currentlog;

if ($name =~ /\Aopr[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\.log/i)

if ( $name =~ /\Aopr\d{8}\.log/i)

Now we can see that is expecting 8 digit chars without getting
fingerprints on the screen counting them.

sub processFile
[snip]

while (<INFILE>)
{
push(@filetext,$_);
}

You can replace all of that with:

if (/(\<.*?\>)/)

Angle brackets are not special in regexes, there is no need to backslash them:

$bits =~ s/\<//g;
$bits =~ s/\>//g;

tr/// will always be faster that s///g, so:

sub by_this

That is a horridly poor choice of sub name...

{
($tat,$abit)=split (/\@/,$a);
($tat,$bbit)=split (/\@/,$b);
($abit cmp $bbit) || ($a cmp $b);
}

That can be sped up with a Schwartzian Transform, as described
in one of the sorting FAQs:

perldoc -q sort

if ($dom =~ (/thisdomain\.co\.uk/)) {return 1;}

Pattern matching is not the Right Tool when what you want to match
is a constant string rather than a pattern:

if ( index $dom, 'thisdomain.co.uk' >= 0 ) { return 1 }

index() will be faster than m// too.

John W. Krahn · Mar 15, 2005

Colin said:
What you're saying is akin to me going to a car mechanic and telling her
"my car is broken. how do I fix it?" without me allowing her to see the
car.

You won't get any useful replies like that. Plus, a little critique of
your coding style should be a Good Thing (tm).

Click to expand...

Ok, here goes, warts and all (I removed names to protect the guilty). The
program reads through Mailsweeper log files and spits out how many emails
for certain domains are recorded in the logs, and also a list of recipients
and number of emails they receive. The log is a trimmed version of an SMTP
communication.

Please note:
* I'm not a programmer, be gentle
* This isnt a production program
* Clever regular expressions make my head explode when I look at them a
while after writing the code
* I'm barely a techy these days
* Don't run this scissors

In answer to another post, I had the program setup so that it read line by
line, and changed it to gobble the entire file at once to see if would have
a performance impact

Thanks to all for taking the time!

[snip code]

Perhaps this will work better: (UNTESTED)

use warnings;
use strict;
use File::Find;

my @valid_domains = qw(
thisdomain.co.uk
thatdomain.co.uk
otherdomain.com
);

my %DOMS = ( thisdom => 12 );

find( sub {
return unless /\Aopr\d{8}\.log\z/i
print " Opening $_\n";
open INFILE, '<', $_ or die "EK $!: $_";
while ( my $line = <INFILE> ) {
next unless $line =~ /RCPT\sTO:/i and $line =~ /</ and $line =~ />/;
for my $valid_domain ( @valid_domains ) {
if ( $line =~ /<(\S+\@\S*\Q$valid_domain\E)>/ ) {
$DOMS{ lc $1 }++;
}
}
}
close INFILE;
}, '.' );

sub by_this {
( split /\@/, $a )[ 1 ] cmp ( split /\@/, $b )[ 1 ] || $a cmp $b
}

open OUTFILE, '>', 'allemails.csv' or die "eeK:$!";
open OUTF, '>', 'alldoms.csv' or die "yikes:$!";

my %DOMMER = ( testdom => 5 );

for my $key ( sort by_this keys %DOMS ) {
$DOMMER{ ( split /\@/, $key )[ 1 ] }++;
print OUTFILE qq("$key","$DOMS{$key}"\n);
}

for my $key ( sort by_this keys %DOMMER ) {
print OUTF qq("$key","$DOMMER{$key}"\n);
}

__END__

John

Tad McClellan · Mar 15, 2005

Ala Qumsieh said:
In general, it's not a good idea to use & when calling a subroutine.
This was necessary for Perl4 and before, but that is at least 10 years
old. Now, they are not needed, and using them has some side effects
which are described in perlsub.

Yes, but this happens to be the one case where using ampersand _is_
a good idea, namely when you have a collision between a user-defined
sub and a builtin and you want to call the user-defined one.

Perl allows you to use unix-style dir hierarchy delimiters, even on
Windows.

It isn' Perl that is OK with forward slashes it is the Operating
System (apart from the command interpreter) that is OK with them.

You can use forward slashes on Windows even if you are not in
Perl (provided you are not in the command interpreter).

Thomas Kratz · Mar 16, 2005

Tad said:
You can use forward slashes on Windows even if you are not in
Perl (provided you are not in the command interpreter).

Even cmd.exe can handle forward slashes, provided you enclose the path or
filename in double quotes.

Thomas

--
$/=$,,$_=<DATA>,s,(.*),$1,see;__END__
s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
$_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
'%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-

FAQ 3.19 How can I make my CGI script more efficient?	0	Jan 28, 2011
More general programming than perl...	18	Apr 30, 2014
Can this be done (by a noob :))	27	Jul 30, 2010
more efficient date range comparison	7	Apr 12, 2009
efficient way to write multiple loops code	18	Oct 7, 2008
Could this expression parser be more 'Pythonic'?	2	May 7, 2009
Is there a more efficient way to do this ?	14	Jan 27, 2004
How can this be done?	11	Apr 10, 2004

Can this be more efficient?

Colin chaplin

Paul Lalli

Colin chaplin

Tad McClellan

Paul Lalli

xhoster

Ala Qumsieh

Colin chaplin

Martin Kissner

Fabian Pilkowski

Ala Qumsieh

Tad McClellan

John W. Krahn

Tad McClellan

Thomas Kratz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads