writing large dictionaries to file using cPickle

perfreem · Jan 28, 2009

hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...

i have tried optimizing this by using this:

s = pickle.dumps(mydict, 2)
pfile.write(s)

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

python · Jan 28, 2009

Hi,

Change:

pickle.dump(mydict, pfile)

to:

pickle.dump(mydict, pfile, -1 )

I think you will see a big difference in performance and also a much
smaller file on disk.

BTW: What type of application are you developing that creates so many
dictionaries? Sounds interesting.

Malcolm

perfreem · Jan 28, 2009

Hi,

Change:

pickle.dump(mydict, pfile)

to:

pickle.dump(mydict, pfile, -1 )

I think you will see a big difference in performance and also a much
smaller file on disk.

BTW: What type of application are you developing that creates so many
dictionaries? Sounds interesting.

Malcolm

hi!

thank you for your reply. unfortunately i tried this but it doesn't
change the speed. it's still writing the file extremely slowly. i'm
not sure why?

thank you.

Aaron Brady · Jan 28, 2009

hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example, snip
but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

There is the 'shelve' module. You could create a shelf that tells you
the filename of the 5 other ones. A million keys should be no
problem, I guess. (It's standard library.) All your keys have to be
strings, though, and all your values have to be pickleable. If that's
a problem, yes you will need ZODB or Django (I understand), or another
relational DB.

John Machin · Jan 28, 2009

hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...

i have tried optimizing this by using this:

s = pickle.dumps(mydict, 2)
pfile.write(s)

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).

perfreem · Jan 28, 2009

hello all,

Click to expand...

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

Click to expand...

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

Click to expand...

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

Click to expand...

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

Click to expand...

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

Click to expand...

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

Click to expand...

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...

Click to expand...

i have tried optimizing this by using this:

Click to expand...

s = pickle.dumps(mydict, 2)
pfile.write(s)

Click to expand...

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

Click to expand...

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?

Aaron Brady · Jan 28, 2009

On Jan 29, 3:13 am, (e-mail address removed) wrote:

hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

Click to expand...

Click to expand...

snip

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?

Yes, but not all at once. It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.

You said you have a million dictionaries. Even if each took only one
byte, you would still have a million bytes. Do you expect a faster I/
O time than the time it takes to write a million bytes?

I want to agree with John's worry about RAM, unless you have several+
GB, as you say. You are not dealing with small numbers.

John Machin · Jan 28, 2009

On Jan 29, 3:13 am, (e-mail address removed) wrote:

hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:
from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()
this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.
is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...
i have tried optimizing this by using this:
s = pickle.dumps(mydict, 2)
pfile.write(s)
but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

Click to expand...

Click to expand...

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).

Click to expand...

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

More quick silly questions:

(1) How long does it take to load that 300MB pickle back into memory
using:
(a) cpickle.load(f)
(b) f.read()
?

What else is happening on the machine while you are creating the
pickle?

(2) How does

Gabriel Genellina · Jan 29, 2009

En Wed said:
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

[pickle] creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

There is an undocumented Pickler attribute, "fast". Usually, when the same
object is referenced more than once, only the first appearance is stored
in the pickled stream; later references just point to the original. This
requires the Pickler instance to remember every object pickled so far --
setting the "fast" attribute to a true value bypasses this check. Before
using this, you must be positively sure that your objects don't contain
circular references -- else pickling will never finish.

py> from cPickle import Pickler
py> from cStringIO import StringIO
py> s = StringIO()
py> p = Pickler(s, -1)
py> p.fast = 1
py> x = [1,2,3]
py> y = [x, x, x]
py> y
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y[0] is y[1]
True
py> p.dump(y)
<cPickle.Pickler object at 0x00BC0E48>
py> s.getvalue()
'\x80\x02](](K\x01K\x02K\x03e](K\x01K\x02K\x03e](K\x01K\x02K\x03ee.'

Note that, when unpickling, shared references are broken:

py> s.seek(0,0)
py> from cPickle import load
py> y2 = load(s)
py> y2
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y2[0] is y2[1]
False

perfreem · Jan 30, 2009

On Jan 29, 3:13 am, (e-mail address removed) wrote:
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

Click to expand...

snip

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?

Click to expand...

Yes, but not all at once. It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.

You said you have a million dictionaries. Even if each took only one
byte, you would still have a million bytes. Do you expect a faster I/
O time than the time it takes to write a million bytes?

I want to agree with John's worry about RAM, unless you have several+
GB, as you say. You are not dealing with small numbers.

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.

Aaron Brady · Jan 31, 2009

On Jan 28, 4:43 pm, (e-mail address removed) wrote:

On Jan 29, 3:13 am, (e-mail address removed) wrote:
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

Click to expand...

snip

Click to expand...

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?

Click to expand...

Click to expand...

Yes, but not all at once. It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.

Click to expand...

You said you have a million dictionaries. Even if each took only one
byte, you would still have a million bytes. Do you expect a faster I/
O time than the time it takes to write a million bytes?

Click to expand...

I want to agree with John's worry about RAM, unless you have several+
GB, as you say. You are not dealing with small numbers.

Click to expand...

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

No not identical. 'shelve' is not a dictionary, it's a database
object that implements the mapping protocol. 'isinstance( shelve,
dict )' is False, for example.

the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

You can try copying a 1-MB file. Or something like:

f= open( 'temp.temp', 'w' )
for x in range( 100000 ):
f.write( '0'* 10 )

You know how long it takes OSes to boot, right?

the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.

You could fall back to storing a parallel list by hand, if you're just
using string and numeric primitives.

repi · Apr 26, 2009

Hello,

I'm having the same problem. I'm creating two dictionaries with about
700,000 to 1,600,000 keys each, each key pointing to a smaller structure of
embedded dictionaries. On a 8 x 3.0GHz Xeon Mac Pro with 17GB of RAM,
cPickle is excruciatingly slow. Moreover, it crashes with memory error (12).
I've been looking for alternatives, even thinking of writing up my own file
format and read/write functions. As a last resort I tried dumping the two
lists with the marshal module and it worked great. A 200MB and a 300MB file
are created in a couple of seconds without a crash.

You should be warned that marshal is not meant as a general data-persistence
module. But as long as you have a rather simple data structure and you are
using it with a specific python version, it seems to run circles around
cPickle.

Hope it helps.

optimizing large dictionaries	10	Jan 15, 2009
error during call to cPickle	0	Dec 17, 2007
Fwd: Re: Comparing Dictionaries	1	Jul 30, 2007
writing large files quickly	36	Jan 27, 2006
memory use with regard to large pickle files	3	Oct 15, 2008
Oddity with large dictionary (several million entries)	4	Apr 27, 2010
writing to a file	5	May 30, 2007
Efficient processing of large nuumeric data file	12	Jan 18, 2008

writing large dictionaries to file using cPickle

perfreem

python

perfreem

Aaron Brady

John Machin

perfreem

Aaron Brady

John Machin

Gabriel Genellina

perfreem

Aaron Brady

repi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads