speed problems

^ · Jun 3, 2004

Hi group,

I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files for further processing.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;
perl runs:
0.07 real 0.05 user 0.01 sys
0.07 real 0.05 user 0.01 sys
0.07 real 0.04 user 0.02 sys
python runs:
0.27 real 0.23 user 0.03 sys
0.28 real 0.21 user 0.05 sys
0.27 real 0.19 user 0.06 sys

This was measured with a small uncompressed logfile (1.4M). The difference
grows much bigger whenever it needs to uncompress things.

Here are both scripts, could you please have a look and tell me where I
should look for optimizations?

perl:
my (@maillogs) = (
"/home/logs/maillog", "/home/logs/maillog.0.gz",
"/home/logs/maillog.1.gz", "/home/logs/maillog.2.gz",
"/home/logs/maillog.3.gz",
);

my ($gzip) = "/usr/bin/gzip";
my ($bzip2)= "/usr/bin/bzip2";

my ($total) = 0.0;
my (%virstat);

foreach my $logfile (@maillogs)
{
if ( -f $logfile )
{
# is it compressed?
if ( $logfile =~ /\.[bg]z2?$/ )
{
if ( !open LF, "$gzip -cd $logfile|" )
{
open LF, "$bzip2 -cd $logfile|" or
die "unable to uncompress '$logfile'\n";
}
}
else
{
open LF, "<$logfile" or die "couldn't open '$logfile'\n";
}

while (<LF>)
{
if (/INFECTED/)
{
# we need only the virus name
$_ =~ s/.*INFECTED.*$(.*)$.*/$1/g;
# if multiple virusses found
if (/, /)
{
# split them
my (@vir) = split /, /, $_;
foreach my $v (@vir)
{
chomp $v;
$virstat{$v}++;
$total++;
}
}
else
{
chomp;
$virstat{$_}++;
$total++;
}
}
}
close LF;
}
# else
# {
# print STDERR "'$logfile' doesn't exist, skipping it.\n";
# }
}

foreach my $v (sort keys %virstat)
{
my $p = ($virstat{$v}/$total)*100;
$p = sprintf "%s:\t%5.2f%%", $v, $p;
print "$p\n";
}
#---end of perl script ---

python:
import os
import string
import re

maillogs = [
"/home/logs/maillog", "/home/logs/maillog.0.gz",
"/home/logs/maillog.1.gz", "/home/logs/maillog.2.gz",
"/home/logs/maillog.3.gz"
]
virstat={}
total=0.0 # keep this float

for logfile in maillogs:
if os.path.isfile( logfile ):
# is it compressed?
if logfile[-3:] == '.gz':
import gzip
lf = gzip.GzipFile( logfile, "r" )
else:
if logfile[-4:] == '.bz2':
import bz2
lf = bz2.BZ2File( logfile, "r" )
else:
# uncompressed
lf = open( logfile, "r" )

for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line ).group(1)
if string.count( vname, ", " ):
for vnam in string.split( vname, ", " ):
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1
total += 1
else:
if vname not in virstat:
virstat[vname] = 1
else:
virstat[vname] += 1
total += 1
lf.close()
# else:
# print "logfile '%s' doesn't exist, skipping it." % logfile

for vname in virstat.keys():
p = (virstat[vname]/total)*100
print "%s: %5.2f%%" % (vname, p)
#--- End of python script ---

Thanks for any help you can provide,
Kind regards,

Axel

Steve Lamb · Jun 3, 2004

Here are both scripts, could you please have a look and tell me where I
should look for optimizations?

Well, I see one major difference and one place I'd do something
differently.

Perl:

my ($gzip) = "/usr/bin/gzip";
my ($bzip2)= "/usr/bin/bzip2";

First off you're using exernal programs here for decompression. This is a
trade off of making a system call vs internal implementation. Maybe Python's
implementation is slower? I don't know, just pointing out that it is a
difference. Personally when programming tools like this I try to keep
everything internal because I've had endless system calls kill the run-time.
However with the few files you're iterating over the cost might be the other
way 'round.

Python:

for line in lf.readlines():
if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line ).group(1)

If I read this correctly you're compiling this regex every time you're
going through the for loop. So every line the regex is compiled again. You
might want to compile the regex outside the loop and only use the compiled
version inside the loop.

I *think* that Perl caches compiled regexs which is why they don't have
two different ways of calling the regex while Python, in giving two different
calls to the regex, will compile it every time if you expressedly call for a
compile. Again, just a guess based on how I presume the languages work and
how I'd write them differently.

Jacek Generowicz · Jun 3, 2004

^ said:
could you please [...] tell me where I should look for
optimizations?

import profile
help(profile)

import hotshot
help(hotshot)

(Teach a man to fish ... and all that

Jeff Epler · Jun 3, 2004

In addition to the items Steve Lamb noted, I have a few suggestions:

Place the whole script in a function and call it. This will give you an
immediate speedup of some percent, because lookup of names that are
local to a function is faster than looking up names at the module level.

for line in lf.readlines():

Unless the bzip2 or gzip modules don't support it, you should write
for line in lf:
instead. This is likely to improve memory consumption, and may improve
the program speed too.

if string.count( line, "INFECTED" ):
vname = re.compile( "INFECTED $(.*)$" ).search( line ).group(1)

Unless you arrived at this two-step process through profiling, it's
probably better to write
m = infected_rx.search(line)
if m:
vname = m.group(1)
...

if string.count( vname, ", " ):
for vnam in string.split( vname, ", " ): [...]
else:

If there are no ", " in vname, the split will produce a single item.
Also, there's no no reason to use the "string" module anymore, as
opposed to string methods. Finally, splitting on single characters is
likely to be optimized, but I'm not sure.

I'd just use
for vnam in vname.split(","):
vnam = vnam.strip()

if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1

You have several alternatives here:
try:
virstat[vnam] += 1
except KeyError:
virstat[vnam] = 1
or
virstat[vnam] = virstat.get(vnam, 0) + 1

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFAvz3bJd01MZaTXX0RAhBfAJ41BXto3mvPuQeFTCIAH0riUgxN4gCfbq5U
9VTbyFf2cUyLTX3tXmLnLuc=
=PDTc
-----END PGP SIGNATURE-----

Axel Scheepers · Jun 3, 2004

First off you're using exernal programs here for decompression. This
is a

trade off of making a system call vs internal implementation. Maybe Python's
implementation is slower? I don't know, just pointing out that it is a
difference. Personally when programming tools like this I try to keep
everything internal because I've had endless system calls kill the run-time.
However with the few files you're iterating over the cost might be the other
way 'round.

I'll be looping over these files only, but I thought using python's gzip
module would be faster then spawning gzip itself the way I did in the perl
script.

Python: line ).group(1)

If I read this correctly you're compiling this regex every time you're
going through the for loop. So every line the regex is compiled again. You
might want to compile the regex outside the loop and only use the compiled
version inside the loop.

Well, only for lines containing 'INFECTED' then. Good point. (I suddenly
remember some c stuff in which it made a huge difference) I've placed it
outside the loop now, but the times are still the same.

Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.

Thanks for your quick reply,
Kind regards,

Axel

Heiko Wundram · Jun 3, 2004

Am Donnerstag, 3. Juni 2004 17:04 schrieb Axel Scheepers:

Another difference might be while( <filehandle>) and line in
lf.readlines().

In Python, while( <FH> ) becomes:

for line in fh:
<something>

This will truly iterate over the lines of the file, not preload anything into
memory.

HTH!

Heiko.

Axel Scheepers · Jun 3, 2004

Jacek Generowicz said:
^ said:

could you please [...] tell me where I should look for
optimizations?

Click to expand...

import profile
help(profile)

import hotshot
help(hotshot)

(Teach a man to fish ... and all that

That's kewl!
Thanks!

Regards,
Axel

Roel Schroeven · Jun 3, 2004

Axel said:
Another difference might be while( <filehandle>) and line in lf.readlines().
The latter reads the whole file to memory if I'm not mistaken as the former
will read the file line by line. Why that could make such a difference I
don't know.

line in lf also reads the file line by line, if you're using 2.3. In 2.1
or 2.2 you can use xreadlines for that. I don't if it makes any
difference in performance though, for small files.

Grant Edwards · Jun 3, 2004

line in lf also reads the file line by line, if you're using 2.3. In 2.1
or 2.2 you can use xreadlines for that. I don't if it makes any
difference in performance though, for small files.

I would suspect that reading the entire file at once would yeild
slightly better performance for non-huge files.

Duncan Booth · Jun 3, 2004

Python:

If I read this correctly you're compiling this regex every time
you're
going through the for loop. So every line the regex is compiled
again. You might want to compile the regex outside the loop and only
use the compiled version inside the loop.

I *think* that Perl caches compiled regexs which is why they don't
have
two different ways of calling the regex while Python, in giving two
different calls to the regex, will compile it every time if you
expressedly call for a compile. Again, just a guess based on how I
presume the languages work and how I'd write them differently.

No, Python will cache the calls to compile the regex so you won't get much
speed difference unless you have enough different regexes to overflow the
cache. Pulling the compile out of the loop is a good idea on general
principles though.

The code you quoted does have one place to optimise: using readlines,
especially on a large file will be *much* slower than just iterating over
the file object directly.

i.e. use

for line in lf:
... whatever ...

Some other things that could be improved (although I suspect the real
problem was calling readlines):

The original code posted uses functions from the string module. Using
string methods instead ought to be faster e.g. line.count("INFECTED")
instead of string.line(count, "INFECTED")

Use
if logfile.endswith('.gz'):
instead of:
if logfile[-3:] == '.gz':

Use:
if "INFECTED" in line:
instead of calling line.count

I don't understand why the inner loop needs two cases, one for vname
containing a ',' and one where it doesn't. It looks to me as though the
code could just split whether or not there is a comma. If there isn't one
it just returns the original string.

Untested revised code:

INFECTEDPAT = re.compile( "INFECTED $(.*)$" )
for line in lf:
if "INFECTED" in line:
vname = INFECTEDPAT.search(line).group(1)
for vnam in vname.split(", "):
if vnam not in virstat:
virstat[vnam] = 1
else:
virstat[vnam] += 1
total += 1
lf.close()

Dennis Lee Bieber · Jun 3, 2004

for logfile in maillogs:
if os.path.isfile( logfile ):
# is it compressed?
if logfile[-3:] == '.gz':
import gzip

It's probably only costing a few milliseconds (vs a reload() ),
but why not move the import (and the next one too) to the top of the
program, rather than having the interpreter do the internal lookup only
to discover that it may already have imported the module...

lf = gzip.GzipFile( logfile, "r" )
else:
if logfile[-4:] == '.bz2':
import bz2

--

Antonio Cavallo · Jun 8, 2004

I've become interested in Python a while ago and just converted a simple

perl script to python.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;

I had the same problem in dealing with a large (compressed) file using
python vs c++ (using the gzip library to open/read a file): the
results were in favour of python against c++ this time

I think the problem is in the flow:

gzip -> file/pipe -> perl
file -> zlib -> python

The decompression through zlib is wfar slower because it is carried
using data chunks: there is no mean to control where a chunk will
terminate (like in correspondence to an EOL).
Try using: zcat <myfile> | myprogram.py
In my case it solved all the puzzling speed problems: zcat
uncoditionally dumps the data irregarding where the EOL happens (so
the upper layer should not wait for the next chunk to be
decompressed).
regards,
antonio cavallo

Neal Holtz · Jun 9, 2004

^ said:
Hi group,

I've become interested in Python a while ago and just converted a simple
perl script to python. The script is very simple, it generates a list of
found virusses from some maillog files for further processing.
I've found that there's a huge difference in execution time for the scripts,
in favor of perl and I can't pinpoint what's going wrong;
. . .
Thanks for any help you can provide,
Kind regards,

Axel

I've halved the python time on my test by changing the entire inner loop
to:

pat = re.compile( "MALWARE:\s+(.*)" )
for line in lf:
mo = pat.search( line )
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam,0) + 1
total += 1
lf.close()

(with changes form my logfile format):

Of course, its no longer the same as the perl version, and
the perl version would also probably benefit from something
similar.

Roy Smith · Jun 9, 2004

I've halved the python time on my test by changing the entire inner loop
to:

pat = re.compile( "MALWARE:\s+(.*)" )
for line in lf:
mo = pat.search( line )
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam,0) + 1
total += 1
lf.close()

A few random thoughts...

1) How often is mo true? In other words, what percentage of the lines
in the file match the pattern? If it's very high or very low, that
might give you some ideas where to look. Running the profiler will help
at lot too!

2) It's probably a minor tweak, but you could factor out the
"pat.search" name lookup cost by doing something like:

pat = re.compile( "MALWARE:\s+(.*)" )
patSearch = pat.search
for line in lf:
mo = patSearch (line)
....

3) There's a certain amount of duplication of effort going on between
the regex search and the split; two passes over the same data. Is it
possible to write a regex which parses it all in one pass? Of course,
if the answer to question #1 is "very low percentage", then this is
probably not the place to be looking.

4) What does virstat.get do?

Lastly, a general programming style comment. I'm a fan of longish
variable names which describe what they're doing. I had to think a bit
to figure out that vnam probably stands for "virus name". It would have
been easier for me to figure out if you named the variable something
like virusName (or even virus_name, if you're an underscore fan). Same
with the other variable names.

Hans-Peter Jansen · Jun 10, 2004

Hi Axel & Pythoneers,

I played around with your scripts, and my winner got a bit longer than
usual. I hope, the important part doesn't suffered to much, but thanks
to a cheat, this one is faster than your original perl script, even with
profiling and annotations enabled! Talking 'bout the latter: shamelessly
stolen from a Zope check in by our master master, because I couldn't get
hotshot to produce useful per line statistics out of the box.
http://mail.zope.org/pipermail/zope-cvs/2002-May/001035.html

Well, although I have to admit, that perl seems faster on this specific
task (since the grep cheat would work for perl too), I would never
consider such a move, just try to do this with perl:

---8<--- [virstat.py] ---8<---
#!/usr/bin/python

import os
import re

maillogs = [
"mail",
"mail-20040600.gz",
"mail-20040604.gz",
"mail-20040607.gz",
"mail-20040610.gz"
]

#gzip = "/usr/bin/gzip -dc"
#bzip2 = "/usr/bin/bzip2 -dc"
gzip = "/usr/bin/zcat"
bzip2 = "/usr/bin/bzcat"
virstat = {}
total = 0
doprof = 1
pat = re.compile( "INFECTED $(.*)$" )

def dovirstat():
global virstat, total
for logfile in maillogs:
if os.path.isfile(logfile):
# is it compressed?
if logfile.endswith('.gz'):
#ifd, lfd = os.popen2("%s %s" % (gzip, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (gzip, logfile))
elif logfile.endswith('.bz2'):
#ifd, lfd = os.popen2("%s %s" % (bzip2, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (bzip2, logfile))
else:
# uncompressed
lfd = open(logfile, "r")

# hot loop
for line in lfd:
mo = pat.search(line)
if mo:
for vnam in mo.group(1).split( ", "):
virstat[vnam] = virstat.get(vnam, 0) + 1
total += 1
lfd.close()
# else:
# print "logfile '%s' doesn't exist, skipping it." % logfile

def load_line_info(log):
byline = {}
prevloc = None
for what, place, tdelta in log:
if tdelta > 0:
t, nhits = byline.get(prevloc, (0, 0))
byline[prevloc] = (tdelta + t), (nhits + 1)
prevloc = place
return byline

def basename(path, cache={}):
try:
return cache[path]
except KeyError:
fn = os.path.split(path)[1]
cache[path] = fn
return fn

def print_results(results):
for info, place in results:
if not place:
print 'Bad unpack:', info, place
continue
filename, line, funcname = place
print '%8d %8d' % info, basename(filename), line

def annotate_results(results):
files = {}
for stats, place in results:
if not place:
continue
time, hits = stats
file, line, func = place
l = files.get(file)
if l is None:
l = files[file] = []
l.append((line, hits, time))
order = files.keys()
order.sort()
for k in order:
if os.path.exists(k):
v = files[k]
v.sort()
annotate(k, v)

def annotate(file, lines):
print "-" * 60
print file
print "-" * 60
f = open(file)
i = 1
match = lines[0][0]
for line in f:
if match == i:
print "%6d %8d " % lines[0][1:], line,
del lines[0]
if lines:
match = lines[0][0]
else:
match = None
else:
print " " * 16, line,
i += 1
print

if not doprof:
dovirstat()
else:
import hotshot
prof = hotshot.Profile("virstat.prof", lineevents=1)
prof.runcall(dovirstat)
prof.close()

vlist = virstat.keys()
vlist.sort()
for vname in vlist:
p = (virstat[vname] / float(total)) * 100
print "%-30s %5.2f%%" % (vname, p)
print

if doprof:
from hotshot.log import LogReader

log = LogReader("virstat.prof")
byline = load_line_info(log)
results = [(v, k) for k, v in byline.items() if k and k[0] == 'virstat.py' ]
results.sort()
#print_results(results)
annotate_results(results)

--->8---

Python programming is not only an easy way to get necessary work done,
on it's best it combines art and science in an esthetic manner.

Pete

Martin Maney · Jun 12, 2004

Hans-Peter Jansen said:
if logfile.endswith('.gz'):
#ifd, lfd = os.popen2("%s %s" % (gzip, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (gzip, logfile))
elif logfile.endswith('.bz2'):
#ifd, lfd = os.popen2("%s %s" % (bzip2, logfile))
#XXX: cheating
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (bzip2, logfile))
else:
# uncompressed
lfd = open(logfile, "r")

Why stop there? You've left on the verge of collapsing into the fully
reduced (and regularized) form:

if logfile.endswith('.gz'):
cat_command = 'zcat'
elif logfile.endswith('.bz2'):
cat_command = 'bzcat'
else:
cat_command = 'cat'
ifd, lfd = os.popen2("%s %s | grep INFECTED" % (cat_command, logfile))

(for that matter, is there some reason to use popen2 and the
unnecessary ifd?)

I've found it advantageous to preprocess large inputs with grep - the
tens of MB of squid logs that are skimmed by a useful little CGI script
really benefited from that! Python's raw I/O may be as good as
anything, but for line by line parsing where a majority of the (many)
lines are discarded, a grep prefilter is a big win.

Which may or may not bring us back to... well, it's not a corollary to
Steve Lamb's guideline for using shell script, though it's clearly
related. Maybe it's the contrapositive. Never get so wrapped up in
using that Python hammer that you forget about those carefully honed
specialized tools. Use that one line of shell to the best effect!
<grin>

Speed ain't bad	14	Dec 30, 2004
Help with my responsive home page	2	Dec 14, 2022
tarfile.open(mode='w:gz'\|'w\|gz'\|..., fileobj=StringIO()) fails.	6	May 26, 2008
Speed of pysnmp	4	Jul 10, 2004
Getting Apache error when trying to load mod_ruby	0	Jan 16, 2008
Perl Script Problems	1	Jan 24, 2005
extension_pack	0	Jan 5, 2006
Compiling a c program	5	May 4, 2011

speed problems

^

Steve Lamb

Jacek Generowicz

Jeff Epler

Axel Scheepers

Heiko Wundram

Axel Scheepers

Roel Schroeven

Grant Edwards

Duncan Booth

Dennis Lee Bieber

Antonio Cavallo

Neal Holtz

Roy Smith

Hans-Peter Jansen

Martin Maney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads