writing large dictionaries to file using cPickle

P

perfreem

hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...

i have tried optimizing this by using this:

s = pickle.dumps(mydict, 2)
pfile.write(s)

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.
 
P

python

Hi,

Change:

pickle.dump(mydict, pfile)

to:

pickle.dump(mydict, pfile, -1 )

I think you will see a big difference in performance and also a much
smaller file on disk.

BTW: What type of application are you developing that creates so many
dictionaries? Sounds interesting.

Malcolm
 
P

perfreem

Hi,

Change:

pickle.dump(mydict, pfile)

to:

pickle.dump(mydict, pfile, -1 )

I think you will see a big difference in performance and also a much
smaller file on disk.

BTW: What type of application are you developing that creates so many
dictionaries? Sounds interesting.

Malcolm

hi!

thank you for your reply. unfortunately i tried this but it doesn't
change the speed. it's still writing the file extremely slowly. i'm
not sure why?

thank you.
 
A

Aaron Brady

hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example, snip
but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

There is the 'shelve' module. You could create a shelf that tells you
the filename of the 5 other ones. A million keys should be no
problem, I guess. (It's standard library.) All your keys have to be
strings, though, and all your values have to be pickleable. If that's
a problem, yes you will need ZODB or Django (I understand), or another
relational DB.
 
J

John Machin

hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
                key2: [...]}

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB.  but
i do care about speed...

i have tried optimizing this by using this:

s = pickle.dumps(mydict, 2)
pfile.write(s)

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).
 
P

perfreem

hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
                key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:
from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()
this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.
is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB.  but
i do care about speed...
i have tried optimizing this by using this:
s = pickle.dumps(mydict, 2)
pfile.write(s)
but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.

Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?
 
A

Aaron Brady

On Jan 29, 3:13 am, (e-mail address removed) wrote:
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
                key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
snip

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?

Yes, but not all at once. It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.

You said you have a million dictionaries. Even if each took only one
byte, you would still have a million bytes. Do you expect a faster I/
O time than the time it takes to write a million bytes?

I want to agree with John's worry about RAM, unless you have several+
GB, as you say. You are not dealing with small numbers.
 
J

John Machin

On Jan 29, 3:13 am, (e-mail address removed) wrote:
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
                key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:
from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()
this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.
is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB.  but
i do care about speed...
i have tried optimizing this by using this:
s = pickle.dumps(mydict, 2)
pfile.write(s)
but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.
Pardon me if I'm asking the "bleedin' obvious", but have you checked
how much virtual memory this is taking up compared to how much real
memory you have? If the slowness is due to pagefile I/O, consider
doing "about 10" separate pickles (one for each key in your top-level
dictionary).

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

More quick silly questions:

(1) How long does it take to load that 300MB pickle back into memory
using:
(a) cpickle.load(f)
(b) f.read()
?

What else is happening on the machine while you are creating the
pickle?

(2) How does
 
G

Gabriel Genellina

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

[pickle] creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

There is an undocumented Pickler attribute, "fast". Usually, when the same
object is referenced more than once, only the first appearance is stored
in the pickled stream; later references just point to the original. This
requires the Pickler instance to remember every object pickled so far --
setting the "fast" attribute to a true value bypasses this check. Before
using this, you must be positively sure that your objects don't contain
circular references -- else pickling will never finish.

py> from cPickle import Pickler
py> from cStringIO import StringIO
py> s = StringIO()
py> p = Pickler(s, -1)
py> p.fast = 1
py> x = [1,2,3]
py> y = [x, x, x]
py> y
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y[0] is y[1]
True
py> p.dump(y)
<cPickle.Pickler object at 0x00BC0E48>
py> s.getvalue()
'\x80\x02](](K\x01K\x02K\x03e](K\x01K\x02K\x03e](K\x01K\x02K\x03ee.'

Note that, when unpickling, shared references are broken:

py> s.seek(0,0)
py> from cPickle import load
py> y2 = load(s)
py> y2
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
py> y2[0] is y2[1]
False
 
P

perfreem

On Jan 29, 3:13 am, (e-mail address removed) wrote:
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
                key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
snip

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?

Yes, but not all at once.  It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.

You said you have a million dictionaries.  Even if each took only one
byte, you would still have a million bytes.  Do you expect a faster I/
O time than the time it takes to write a million bytes?

I want to agree with John's worry about RAM, unless you have several+
GB, as you say.  You are not dealing with small numbers.

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.
 
A

Aaron Brady

On Jan 28, 4:43 pm, (e-mail address removed) wrote:
On Jan 29, 3:13 am, (e-mail address removed) wrote:
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
                key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?
Yes, but not all at once.  It's a clear winner if you need to update
any of them later, but if it's just write-once, read-many, it's about
the same.
You said you have a million dictionaries.  Even if each took only one
byte, you would still have a million bytes.  Do you expect a faster I/
O time than the time it takes to write a million bytes?
I want to agree with John's worry about RAM, unless you have several+
GB, as you say.  You are not dealing with small numbers.

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

No not identical. 'shelve' is not a dictionary, it's a database
object that implements the mapping protocol. 'isinstance( shelve,
dict )' is False, for example.
the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

You can try copying a 1-MB file. Or something like:

f= open( 'temp.temp', 'w' )
for x in range( 100000 ):
f.write( '0'* 10 )

You know how long it takes OSes to boot, right?
the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.

You could fall back to storing a parallel list by hand, if you're just
using string and numeric primitives.
 
R

repi

Hello,

I'm having the same problem. I'm creating two dictionaries with about
700,000 to 1,600,000 keys each, each key pointing to a smaller structure of
embedded dictionaries. On a 8 x 3.0GHz Xeon Mac Pro with 17GB of RAM,
cPickle is excruciatingly slow. Moreover, it crashes with memory error (12).
I've been looking for alternatives, even thinking of writing up my own file
format and read/write functions. As a last resort I tried dumping the two
lists with the marshal module and it worked great. A 200MB and a 300MB file
are created in a couple of seconds without a crash.

You should be warned that marshal is not meant as a general data-persistence
module. But as long as you have a rather simple data structure and you are
using it with a specific python version, it seems to run circles around
cPickle.

Hope it helps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top