pickle.load() extremely slow performance

Jim Garrison · Mar 20, 2009

I'm converting a Perl system to Python, and have run into a severe
performance problem with pickle.

One facet of the system involves scanning and loading into memory a
couple of parallel directory trees containing OTO 10^4 files. The
trees don't change during development/testing and the scan takes 30-40
seconds, so to save time I cache the loaded tree structure to disk, in
Perl with module Storable, and in Python with pickle.

In Perl, the save operation produces a file of about 3MB, and both
save and restore take a second or two. In Python, pickle.dump()
produces a similar-size file but takes 20 seconds, and pickle.load()
takes 45 seconds, which is actually LONGER than the time required to
scan the directory trees.

Is there anything I can do to speed up pickle.load() to get
performance comparable to Perl's Storable?

John Machin · Mar 20, 2009

I'm converting a Perl system to Python, and have run into a severe
performance problem with pickle.

One facet of the system involves scanning and loading into memory a
couple of parallel directory trees containing OTO 10^4 files. The
trees don't change during development/testing and the scan takes 30-40
seconds, so to save time I cache the loaded tree structure to disk, in
Perl with module Storable, and in Python with pickle.

In Perl, the save operation produces a file of about 3MB, and both
save and restore take a second or two. In Python, pickle.dump()
produces a similar-size file but takes 20 seconds, and pickle.load()
takes 45 seconds, which is actually LONGER than the time required to
scan the directory trees.

Is there anything I can do to speed up pickle.load() to get
performance comparable to Perl's Storable?

John Machin · Mar 20, 2009

I'm converting a Perl system to Python, and have run into a severe
performance problem with pickle.

One facet of the system involves scanning and loading into memory a
couple of parallel directory trees containing OTO 10^4 files. The
trees don't change during development/testing and the scan takes 30-40
seconds, so to save time I cache the loaded tree structure to disk, in
Perl with module Storable, and in Python with pickle.

In Perl, the save operation produces a file of about 3MB, and both
save and restore take a second or two. In Python, pickle.dump()
produces a similar-size file but takes 20 seconds, and pickle.load()
takes 45 seconds, which is actually LONGER than the time required to
scan the directory trees.

Is there anything I can do to speed up pickle.load() to get
performance comparable to Perl's Storable?

Have you read this:
http://www.python.org/doc/2.6/library/pickle.html
?
Have you considered using cPickle instead of pickle?
Have you considered using *ickle.dump(..., protocol=-1) ?

Jim Garrison · Mar 21, 2009

John said:
Have you read this:
http://www.python.org/doc/2.6/library/pickle.html
?
Have you considered using cPickle instead of pickle?
Have you considered using *ickle.dump(..., protocol=-1) ?

I'm using Python 3 on Windows (Server 2003). According to the docs

"The pickle module has an transparent optimizer (_pickle) written
in C. It is used whenever available. Otherwise the pure Python
implementation is used."

How can I tell if _pickle is being used?

Jim Garrison · Mar 21, 2009

Jim said:
John Machin wrote: [snip]

Have you considered using cPickle instead of pickle?
Have you considered using *ickle.dump(..., protocol=-1) ?

Click to expand...

I'm using Python 3 on Windows (Server 2003). According to the docs

"The pickle module has an transparent optimizer (_pickle) written
in C. It is used whenever available. Otherwise the pure Python
implementation is used."

How can I tell if _pickle is being used?

Answered my own question
['PickleError', 'Pickler', 'PicklingError', 'Unpickler',
'UnpicklingError', '__doc__', '__name__', '__package__'] ['__class__', '__delattr__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__gt__', '__hash__', '__init__',
'__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'bin', 'clear_memo', 'dump', 'fast', 'memo', 'persistent_id'] ['__class__', '__delattr__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__gt__', '__hash__', '__init__',
'__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'bin', 'clear_memo', 'dump', 'fast', 'memo', 'persistent_id']

_pickle seems to be there. Also, if I step into the load
call (pydev under Eclipse) it steps into pickle.load() but
won't step into the call to the Unpickler constructor. I
assume that means it's calling out to the C implementation.

Carl Banks · Mar 21, 2009

I'm using Python 3 on Windows (Server 2003). According to the docs

"The pickle module has an transparent optimizer (_pickle) written
in C. It is used whenever available. Otherwise the pure Python
implementation is used."

How can I tell if _pickle is being used?

The slow performance is most likely due to the poor performance of
Python 3's IO, which is caused by (among other things) bad buffering
strategy. It's a Python 3 growing pain, and is being rewritten.
Python 3.1 should be must faster but it's not been released yet.

As a workaround, mmap the file instead. For example (untested):

f = open('dirlisting.dat','rb')
try:
f.seek(0,2)
size = f.tell()
f.seek(0,0)
m = mmap.mmap(f.fileno(),size,access=mmap.ACCESS_READ)
try:
dir_listing = pickle.loads(m)
finally:
m.close()
finally:
f.close()

Pickling the output left as an exercise.

Carl Banks

bearophileHUGS · Mar 21, 2009

Carl Banks:

The slow performance is most likely due to the poor performance of
Python 3's IO, which is caused by [...]

My suggestion for the Original Poster is just to try using Python 2.x,
if possible

Bye,
bearophile

Terry Reedy · Mar 21, 2009

Carl said:
The slow performance is most likely due to the poor performance of
Python 3's IO, which is caused by (among other things) bad buffering
strategy. It's a Python 3 growing pain, and is being rewritten.
Python 3.1 should be must faster but it's not been released yet.

3.1a1 is out and I believe it has the io improvements.

Benjamin Peterson · Mar 21, 2009

Terry Reedy said:
3.1a1 is out and I believe it has the io improvements.

Massive ones, too. It'd be interesting to see your results on the alpha.

Jim Garrison · Mar 23, 2009

Benjamin said:
Massive ones, too. It'd be interesting to see your results on the alpha.

On 3.1a1 the unpickle step takes 2.4 seconds, an 1875% improvement.

Thanks.

Steve Holden · Mar 23, 2009

Jean-Paul Calderone said:
Surely you mean a 94.7% improvement?

Well, since it's now running almost twenty times faster, the speed has
increased by 1875%. Not sure what the mathematics of improvement are ...

regards
Steve

Jim Garrison · Mar 23, 2009

Steve said:
Well, since it's now running almost twenty times faster, the speed has
increased by 1875%. Not sure what the mathematics of improvement are ...

regards
Steve

The arithmetic depends on whether you're looking at time or
velocity, which are inverses of each other.

If you double your velocity (100% increase) the time required goes
down by 50%. A 1000% increase in velocity results in a 90% decrease
in time... etc. I guess I equate "performance" to velocity.

Pickling over a socket	13	Apr 19, 2011
Problem with pickle and restarting a program	3	Mar 19, 2014
tarfile's tar.extractfile() file-like object incompatible with pickle.load()?	5	Aug 26, 2004
Personal archive tool, looking for suggestions on improving the code	5	Jul 27, 2010
Extremely Slow Performance	3	Feb 6, 2006
Pickle problem while loading a class instance.	6	Apr 6, 2010
file seek is slow	11	Mar 9, 2010
Backtick command with long output super slow	5	Dec 4, 2012

pickle.load() extremely slow performance

Jim Garrison

John Machin

John Machin

Jim Garrison

Jim Garrison

Carl Banks

bearophileHUGS

Terry Reedy

Benjamin Peterson

Jim Garrison

Steve Holden

Jim Garrison

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads