Speed ain't bad

B

Bulba!

One of the posters inspired me to do profiling on my newbie script
(pasted below). After measurements I have found that the speed
of Python, at least in the area where my script works, is surprisingly
high.

This is the experiment: a script recreates the folder hierarchy
somewhere else and stores there the compressed versions of
files from source hierarchy (the script is doing additional backups
of the disk of file server at the company where I work onto other
disks, with compression for sake of saving space). The data was:

468 MB, 15057 files, 1568 folders
(machine: win2k, python v2.3.3)

The time that WinRAR v3.20 (with ZIP format and normal compression
set) needed to compress all that was 119 seconds.

The Python script time (running under profiler) was, drumroll...

198 seconds.

Note that the Python script had to laboriously recreate the tree of
1568 folders and create over 15 thousand compressed files, so
it had more work to do actually than WinRAR did. The size of
compressed data was basically the same, about 207 MB.

I find it very encouraging that in the real world area of application
a newbie script written in the very high-level language can have the
performance that is not that far from the performance of "shrinkwrap"
pro archiver (WinRAR is excellent archiver, both when it comes to
compression as well as speed). I do realize that this is mainly
the result of all the "underlying infrastructure" of Python. Great
work, guys. Congrats.

The only thing I'm missing in this picture is knowledge if my script
could be further optimised (not that I actually need better
performance, I'm just curious what possible solutions could be).

Any takers among the experienced guys?



Profiling results:
Fri Dec 31 01:04:14 2004 p3.tmp

580543 function calls (568607 primitive calls) in 198.124 CPU
seconds

Ordered by: cumulative time
List reduced from 69 to 40 due to restriction <40>

ncalls tottime percall cumtime percall
filename:lineno(function)
1 0.013 0.013 198.124 198.124 profile:0(z3())
1 0.000 0.000 198.110 198.110 <string>:1(?)
1 0.000 0.000 198.110 198.110 <interactive
input>:1(z3)
1 1.513 1.513 198.110 198.110 zmtree3.py:26(zmtree)
15057 14.504 0.001 186.961 0.012 zmtree3.py:7(zf)
15057 147.582 0.010 148.778 0.010
C:\Python23\lib\zipfile.py:388(write)
15057 12.156 0.001 12.156 0.001
C:\Python23\lib\zipfile.py:182(__init__)
32002 7.957 0.000 8.542 0.000
C:\PYTHON23\Lib\ntpath.py:266(isdir)
13826/1890 2.550 0.000 8.143 0.004
C:\Python23\lib\os.py:206(walk)
30114 3.164 0.000 3.164 0.000
C:\Python23\lib\zipfile.py:483(close)
60228 1.753 0.000 2.149 0.000
C:\PYTHON23\Lib\ntpath.py:157(split)
45171 0.538 0.000 2.116 0.000
C:\PYTHON23\Lib\ntpath.py:197(basename)
15057 1.285 0.000 1.917 0.000
C:\PYTHON23\Lib\ntpath.py:467(abspath)
33890 0.688 0.000 1.419 0.000
C:\PYTHON23\Lib\ntpath.py:58(join)
109175 0.783 0.000 0.783 0.000
C:\PYTHON23\Lib\ntpath.py:115(splitdrive)
15057 0.196 0.000 0.768 0.000
C:\PYTHON23\Lib\ntpath.py:204(dirname)
33890 0.433 0.000 0.731 0.000
C:\PYTHON23\Lib\ntpath.py:50(isabs)
15057 0.544 0.000 0.632 0.000
C:\PYTHON23\Lib\ntpath.py:438(normpath)
32002 0.431 0.000 0.585 0.000
C:\PYTHON23\Lib\stat.py:45(S_ISDIR)
15057 0.555 0.000 0.555 0.000
C:\Python23\lib\zipfile.py:149(FileHeader)
15057 0.483 0.000 0.483 0.000
C:\Python23\lib\zipfile.py:116(__init__)
151 0.002 0.000 0.435 0.003
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\winout.py:171(write)
151 0.002 0.000 0.432 0.003
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\winout.py:489(write)
151 0.013 0.000 0.430 0.003
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\winout.py:461(HandleOutput)
76 0.087 0.001 0.405 0.005
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\winout.py:430(QueueFlush)
15057 0.239 0.000 0.340 0.000
C:\Python23\lib\zipfile.py:479(__del__)
15057 0.157 0.000 0.157 0.000
C:\Python23\lib\zipfile.py:371(_writecheck)
32002 0.154 0.000 0.154 0.000
C:\PYTHON23\Lib\stat.py:29(S_IFMT)
76 0.007 0.000 0.146 0.002
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\winout.py:262(dowrite)
76 0.007 0.000 0.137 0.002
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\formatter.py:221(OnStyleNeeded)
76 0.011 0.000 0.118 0.002
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\interact.py:197(Colorize)
76 0.110 0.001 0.112 0.001
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\control.py:69(SCIInsertText)
76 0.079 0.001 0.081 0.001
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\control.py:333(GetTextRange)
76 0.018 0.000 0.020 0.000
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\control.py:296(SetSel)
76 0.006 0.000 0.018 0.000
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\document.py:149(__call__)
227 0.003 0.000 0.012 0.000
C:\Python23\lib\Queue.py:172(get_nowait)
76 0.007 0.000 0.011 0.000
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\framework\interact.py:114(ColorizeInteractiveCode)
532 0.011 0.000 0.011 0.000
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\control.py:330(GetTextLength)
76 0.001 0.000 0.010 0.000
C:\PYTHON23\lib\site-packages\Pythonwin\pywin\scintilla\view.py:256(OnBraceMatch)
1888 0.009 0.000 0.009 0.000
C:\PYTHON23\Lib\ntpath.py:245(islink)


---
Script:

#!/usr/bin/python

import os
import sys
from zipfile import ZipFile, ZIP_DEFLATED

def zf(sfpath, targetdir):
if (sys.platform[:3] == 'win'):
tgfpath=sfpath[2:]
else:
tgfpath=sfpath
zfdir=os.path.dirname(os.path.abspath(targetdir) + tgfpath)
zfpath=zfdir + os.path.sep + os.path.basename(tgfpath) + '.zip'
if(not os.path.isdir(zfdir)):
os.makedirs(zfdir)
archive=ZipFile(zfpath, 'w', ZIP_DEFLATED)
sfile=open(sfpath,'rb')
zfname=os.path.basename(tgfpath)
archive.write(sfpath, os.path.basename(zfpath), ZIP_DEFLATED)
archive.close()
ssize=os.stat(sfpath).st_size
zsize=os.stat(zfpath).st_size
return (ssize,zsize)


def zmtree(sdir,tdir):
n=0
ssize=0
zsize=0
sys.stdout.write('\n ')
for root, dirs, files in os.walk(sdir):
for file in files:
res=zf(os.path.join(root,file),tdir)
ssize+=res[0]
zsize+=res[1]
n=n+1
#sys.stdout.write('.')
if (n % 200 == 0):
print " %.2fM (%.2fM)" % (ssize/1048576.0,
zsize/1048576.0)
#sys.stdout.write(' ')
return (n, ssize, zsize)


if __name__=="__main__":
if len(sys.argv) == 3:
if(os.path.isdir(sys.argv[1]) and os.path.isdir(sys.argv[2])):

(n,ssize,zsize)=zmtree(os.path.abspath(sys.argv[1]),os.path.abspath(sys.argv[2]))
print "\n\n Summary:\n Number of files compressed: %d\n
Total size of original files: %.2fM\n \
Total size of compressed files: %.2fM" % (n, ssize/1048576.0,
zsize/1048576.0)
sys.exit(0)
else:
print "Incorrect arguments."
if (not os.path.isdir(sys.argv[1])): print sys.argv[1] + "
is not directory."
if (not os.path.isdir(sys.argv[2])): print sys.argv[2] + "
is not directory."

print "\n Usage:\n " + sys.argv[0] + " source-directory
target-directory"
 
J

Jeremy Bowers

One of the posters inspired me to do profiling on my newbie script (pasted
below). After measurements I have found that the speed of Python, at least
in the area where my script works, is surprisingly high.

This is the experiment: a script recreates the folder hierarchy somewhere
else and stores there the compressed versions of files from source
hierarchy (the script is doing additional backups of the disk of file
server at the company where I work onto other disks, with compression for
sake of saving space). The data was:

I did not study your script but odds are it is strongly disk bound.

This means that the disk access time is so large that it completely swamps
almost everything else.

I would point out a couple of other ideas, though you may be aware of
them: Compressing all the files seperately, if they are small, may greatly
reduce the final compression since similarities between the files can not
be exploited. You may not care. Also, the "zip" format can be updated on a
file-by-file basis; it may do all by itself what you are trying to do,
with just a single command line. Just a thought.
 
C

Craig Ringer

I would point out a couple of other ideas, though you may be aware of
them: Compressing all the files seperately, if they are small, may greatly
reduce the final compression since similarities between the files can not
be exploited.

True; however, it's my understanding that compressing individual files
also means that in the case of damage to the archive it is possible to
recover the files after the damaged file. This cannot be guaranteed when
the archive is compressed as a single stream.
 
R

Reinhold Birkenfeld

Craig said:
True; however, it's my understanding that compressing individual files
also means that in the case of damage to the archive it is possible to
recover the files after the damaged file. This cannot be guaranteed when
the archive is compressed as a single stream.

With gzip, you can forget the entire rest of the stream; with bzip2,
there is a good chance that nothing more than one block (100-900k) is lost.

regards,
Reinhold
 
B

Bulba!

With gzip, you can forget the entire rest of the stream; with bzip2,
there is a good chance that nothing more than one block (100-900k) is lost.

I have actually written the version of that script with bzip2 but
it was so horribly slow that I chose the zip version.
 
B

Bulba!

I would point out a couple of other ideas, though you may be aware of
them: Compressing all the files seperately, if they are small, may greatly
reduce the final compression since similarities between the files can not
be exploited. You may not care.

The problem is about easy recovery of individual files plus storing
and not deleting the older versions of files for some time (users
of the file servers tend to come around crying "I have deleted this
important file created a week before accidentally, where can I find
it?").

The way it is done I can expose the directory hierarchy as read-only
to users and they can get the damn file themselves, they just need
to unzip it. If they were to search through a huge zipfile to find
it, that could be a problem for them.
 
B

Bulba!

With gzip, you can forget the entire rest of the stream; with bzip2,
there is a good chance that nothing more than one block (100-900k) is lost.

A "good chance" sometimes is unacceptable -- I have to have a
guarantee that as long as the hardware isn't broken a user can
recover that old file. We've even thought about storing uncompressed
directory trees, but holding them would consume too much diskspace.
Hence compression had to be used.

(initially, that was just a shell script, but whitespaces and
strange chars that users love to enter into filenames break
just too many shell tools)
 
P

Paul Rubin

Bulba! said:
The only thing I'm missing in this picture is knowledge if my script
could be further optimised (not that I actually need better
performance, I'm just curious what possible solutions could be).

Any takers among the experienced guys?

There's another compression program called LHN which is supposed to be
quite a bit faster than gzip, though with somewhat worse compression.
I haven't gotten around to trying it.
 
P

Paul Rubin

Bulba! said:
A "good chance" sometimes is unacceptable -- I have to have a
guarantee that as long as the hardware isn't broken a user can
recover that old file.

Well, we're talking about an archive that's been damaged, whether by
software or hardware. That damage isn't supposed to happen, but
sometimes it does anyway.
We've even thought about storing uncompressed directory trees, but
holding them would consume too much diskspace. Hence compression
had to be used.

If these are typical files, compression gets you maybe 2:1 shrinkage,
much less on larger files (e.g. multimedia files) which tend to be
incompressible. Disk space is cheap these days, buy more drives.
(initially, that was just a shell script, but whitespaces and
strange chars that users love to enter into filenames break
just too many shell tools)

I didn't look at your script, but why not just use info-zip?
 
B

Bulba!

I didn't look at your script, but why not just use info-zip?

Because I need the users to be able to access the folder
tree with old versions of files, not the one big zipfile -- which
they can't search for their old files using standard Windows
explorer for instance.
 
A

Anders J. Munch

Bulba! said:
One of the posters inspired me to do profiling on my newbie script
(pasted below). After measurements I have found that the speed
of Python, at least in the area where my script works, is surprisingly
high.

Pretty good code for someone who calls himself a newbie.

One line that puzzles me:
sfile=open(sfpath,'rb')

You never use sfile again.
In any case, you should explicitly close all files that you open. Even
if there's an exception:

sfile = open(sfpath, 'rb')
try:
<stuff to do with the file open>
finally:
sfile.close()
The only thing I'm missing in this picture is knowledge if my script
could be further optimised (not that I actually need better
performance, I'm just curious what possible solutions could be).

Any takers among the experienced guys?

Basically the way to optimise these things is to cut down on anything
that does I/O: Use as few calls to os.path.is{dir,file}, os.stat, open
and such that you can get away with.

One way to do that is caching; e.g. storing names of known directories
in a set (sets.Set()) and checking that set before calling
os.path.isdir. I haven't spotted any obvious opportunities for that
in your script, though.

Another way is the strategy of "it's easier to ask forgiveness than to
ask permission".
If you replace:
if(not os.path.isdir(zfdir)):
os.makedirs(zfdir)
with:
try:
os.makedirs(zfdir)
except EnvironmentError:
pass

then not only will your script become a micron more robust, but
assuming zfdir typically does not exist, you will have saved the call
to os.path.isdir.

- Anders
 
B

Bulba!

Pretty good code for someone who calls himself a newbie.

One line that puzzles me:
You never use sfile again.

Right! It's a leftover from a previous implementation (that
used bzip2). Forgot to delete it, thanks.
Another way is the strategy of "it's easier to ask forgiveness than to
ask permission".
If you replace:
if(not os.path.isdir(zfdir)):
os.makedirs(zfdir)
with:
try:
os.makedirs(zfdir)
except EnvironmentError:
pass
then not only will your script become a micron more robust, but
assuming zfdir typically does not exist, you will have saved the call
to os.path.isdir.

Yes, this is the kind of habit that low-level languages like C
missing features like exceptions ingrain in a mind of a programmer...

Getting out of this straitjacket is kind of hard - it would not cross
my mind to try smth like what you showed me, thanks!

Exceptions in Python are a GODSEND. I strongly recommend
to any former C programmer wanting to get rid of a "straightjacket"
to read the following to get an idea how not to write C code in Python
and instead exploit the better side of VHLL:

http://gnosis.cx/TPiP/appendix_a.txt
 
J

Jeff Shannon

Anders said:
Another way is the strategy of "it's easier to ask forgiveness than to
ask permission".
If you replace:
if(not os.path.isdir(zfdir)):
os.makedirs(zfdir)
with:
try:
os.makedirs(zfdir)
except EnvironmentError:
pass

then not only will your script become a micron more robust, but
assuming zfdir typically does not exist, you will have saved the call
to os.path.isdir.

.... at the cost of an exception frame setup and an incomplete call to
os.makedirs(). It's an open question whether the exception setup and
recovery take less time than the call to isdir(), though I'd expect
probably not. The exception route definitely makes more sense if the
makedirs() call is likely to succeed; if it's likely to fail, then
things are murkier.

Since isdir() *is* a disk i/o operation, then in this case the
exception route is probably preferable anyhow. In either case, one
must touch the disk; in the exception case, there will only ever be
one disk access (which either succeeds or fails), while in the other
case, there may be two disk accesses. However, if it wasn't for the
extra disk i/o operation, then the 'if ...' might be slightly faster,
even though the exception-based route is more Pythonic.

Jeff Shannon
Technician/Programmer
Credit International
 
J

John Machin

Anders said:
Another way is the strategy of "it's easier to ask forgiveness than to
ask permission".
If you replace:
if(not os.path.isdir(zfdir)):
os.makedirs(zfdir)
with:
try:
os.makedirs(zfdir)
except EnvironmentError:
pass

then not only will your script become a micron more robust, but
assuming zfdir typically does not exist, you will have saved the call
to os.path.isdir.

1. Robustness: Both versions will "crash" (in the sense of an unhandled
exception) in the situation where zfdir exists but is not a directory.
The revised version just crashes later than the OP's version :-(
Trapping EnvironmentError seems not very useful -- the result will not
distinguish (on Windows 2000 at least) between the 'existing dir' and
'existing non-directory' cases.


Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on
win32Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "c:\Python24\lib\os.py", line 159, in makedirs
mkdir(name, mode)
OSError: [Errno 17] File exists: 'fubar_not_dir'.... os.mkdir('fubar_not_dir')
.... except EnvironmentError:
.... print 'trapped env err'
....
trapped env errTraceback (most recent call last):

2. Efficiency: I don't see the disk I/O inefficiency in calling
os.path.isdir() before os.makedirs() -- if the relevant part of the
filesystem wasn't already in memory, the isdir() call would make it so,
and makedirs() would get a free ride, yes/no?
 
A

Anders J. Munch

John Machin said:
1. Robustness: Both versions will "crash" (in the sense of an unhandled
2. Efficiency: I don't see the disk I/O inefficiency in calling

3. Don't itemise perceived flaws in other people's postings. It may
give off a hostile impression.
1. Robustness: Both versions will "crash" (in the sense of an unhandled
exception) in the situation where zfdir exists but is not a directory.
The revised version just crashes later than the OP's version :-(
Trapping EnvironmentError seems not very useful -- the result will not
distinguish (on Windows 2000 at least) between the 'existing dir' and
'existing non-directory' cases.

Good point; my version has room for improvement. But at least it fixes
the race condition between isdir and makedirs.

What I like about EnvironmentError is that it it's easier to use than
figuring out which one of IOError or OSError applies (and whether that
can be relied on, cross-platform).
2. Efficiency: I don't see the disk I/O inefficiency in calling
os.path.isdir() before os.makedirs() -- if the relevant part of the
filesystem wasn't already in memory, the isdir() call would make it
so, and makedirs() would get a free ride, yes/no?

Perhaps. Looking stuff up in operating system tables and buffers takes
time too. And then there's network latency; how much local caching do
you get for an NFS mount or SMB share?

If you really want to know, measure.

- Anders
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top