my computer is allergic to pickles

Bob Fnord · Mar 5, 2011

I'm using python to do some log file analysis and I need to store
on disk a very large dict with tuples of strings as keys and
lists of strings and numbers as values.

I started by using cPickle to save the instance of the class that
contained this dict, but the pickling process started to write
the file but ate so much memory that my computer (4 GB RAM)
crashed so badly that I had to press the reset button. I've never
seen out-of-memory errors do this before. Is this normal?

(I know from the output that got written before the crash that my
program had finished building the dict and started the
pickle. When I tried running the other program that reads the
pickle and analyzes the data in it, it gave an error because the
file was incomplete. So I know where in my code the crash
happened.)

From searching the web, I get the impression that pickle uses a

lot of memory because it checked for recursion and other things
that could break other serialization methods. So I've switched to
using marshal to save the dict itself (the only persistent thing
in the class, which just has convenience methods for adding data
to the dict and searching it for the second stage of analysis).

I found some references to h5 tables for getting around the
pickling memory problem, but I got the impression they only work
with fixed columns, not a somewhat complex data structure like
mine.

Any comments, suggestions?

MRAB · Mar 5, 2011

I'm using python to do some log file analysis and I need to store
on disk a very large dict with tuples of strings as keys and
lists of strings and numbers as values.

I started by using cPickle to save the instance of the class that
contained this dict, but the pickling process started to write
the file but ate so much memory that my computer (4 GB RAM)
crashed so badly that I had to press the reset button. I've never
seen out-of-memory errors do this before. Is this normal?

(I know from the output that got written before the crash that my
program had finished building the dict and started the
pickle. When I tried running the other program that reads the
pickle and analyzes the data in it, it gave an error because the
file was incomplete. So I know where in my code the crash
happened.)

lot of memory because it checked for recursion and other things
that could break other serialization methods. So I've switched to
using marshal to save the dict itself (the only persistent thing
in the class, which just has convenience methods for adding data
to the dict and searching it for the second stage of analysis).

I found some references to h5 tables for getting around the
pickling memory problem, but I got the impression they only work
with fixed columns, not a somewhat complex data structure like
mine.

Any comments, suggestions?

Would a database work?

Bob Fnord · Mar 7, 2011

MRAB said:
Would a database work?

I want a portable data file (can be moved around the filesystem
or copied to another machine and used), so I don't want to use
mysql or postgres. I guess the "sqlite" approach would work, but
I think it would be difficult to turn the tuples of strings and
lists of strings and numbers into database table lines.

Would a database in a file have any advantages over a file made
by marshal or shelve?

I'm more worried about the fact that a python program in user
space can bring down the computer!

Mel · Mar 7, 2011

Bob said:
I want a portable data file (can be moved around the filesystem
or copied to another machine and used), so I don't want to use
mysql or postgres. I guess the "sqlite" approach would work, but
I think it would be difficult to turn the tuples of strings and
lists of strings and numbers into database table lines.

This is as hairy as it's ever got for me (untested):

def inserter (db, table_name, names, values):
query = 'INSERT INTO %s (%s) VALUES (%s)' % (table_name, ','.join
(names), ','.join (['?'] * len (names)))
cur = db.cursor()
cur.execute (query, values)
cur.close()
#...
for v in all_value_triples:
inserter (db, 'some_table', ['f1', 'f2', 'f3'], v)

(or even write a bulk_inserter that took all_value_triples as an argument
and moved the `for v in ...` inside the function.)

Would a database in a file have any advantages over a file made
by marshal or shelve?

Depends. An sqlite3 database file is usable by programs not written in
Python.

I'm more worried about the fact that a python program in user
space can bring down the computer!

Never been a problem in the programs I've written.

Mel.

Martin P. Hellwig · Mar 7, 2011

On 05/03/2011 01:56, Bob Fnord wrote:

Any comments, suggestions?

No but I have a bunch of pseudo-questions

What version of python are you using? How about your OS and bitspace
(32/64)? Have you also tried using the non-c pickle module? If the data
is very simple in structure, perhaps serializing to CSV might be an option?

Terry Reedy · Mar 7, 2011

I want a portable data file (can be moved around the filesystem
or copied to another machine and used),

Used only by Python or by other software?

Would a database in a file have any advantages over a file made
by marshal or shelve?

If you have read the initial paragraphs of the marshal doc and your
needs fit within its limitations, go ahead and use it. (Also note that
Python could switch to a new version in the future.)

Keyed databases have the advantage that you can change the data file. If
you do not need to do that (as opposed to read in, do whatever, and
write out in entirety) then that is no advantage to you.

Similar to marshal is json, which is more limited but more portable,
because understood by other languages.

Bob Fnord · Mar 9, 2011

Miki Tebeka said:
Or, which situations does shelve suit better and which does
marshal suit better?

Click to expand...

shelve ease of use and the fact it uses the disk to store objects makes it a good choice if you have a lot of object, each with a unique string key (and a tuple of strings can be converted to and from a string).

db = shelve.open("/tmp/foo.db")
db["key1"] = (1, 2, 3)
...

Marshal is faster and IIRC more geared toward network operations. But I haven't used it that much ...

From looking at the shelve info in the library reference, I get

the impression it's tricky to change the values in the dict for
existing keys and be sure they get changed on disk. My dict lists
of strings and integers as values and the lists get changed as
the program analyzes the input files, then stored on disk in
their final form. I guess marshal is better for that.

How can you convert a tuple of strings to a string and back in a
reliable deterministic way? The original strings may have ' " ,
in them.

Bob Fnord · Mar 9, 2011

Terry Reedy said:
Used only by Python or by other software?

just Python

If you have read the initial paragraphs of the marshal doc and your
needs fit within its limitations, go ahead and use it. (Also note that
Python could switch to a new version in the future.)

OK, I think marshal is just what I need.

Keyed databases have the advantage that you can change the data file. If
you do not need to do that (as opposed to read in, do whatever, and
write out in entirety) then that is no advantage to you.

OK, thanks

Bob Fnord · Mar 9, 2011

Martin P. Hellwig said:
On 05/03/2011 01:56, Bob Fnord wrote:

No but I have a bunch of pseudo-questions

What version of python are you using? How about your OS and bitspace
(32/64)? Have you also tried using the non-c pickle module? If the data
is very simple in structure, perhaps serializing to CSV might be an option?

python 2.6.6

ubuntu 64 bit

The library ref says cPickle is "optimized" and "up to 1000 times
faster than pickle" but of course doesn't mention memory.

The data to save (and load in another python program) is a dict
with keys = tuples of strings (including " ' , and other
troublesome characters) and values = lists of strings and
integers. As the 1st program runs, it adds new keys AND changes
the contents of the value lists. (The 2nd program only reads the
dict into memory, analyzes it, and prints to STDOUT.)

Peter Otten · Mar 9, 2011

Bob said:
I'm using python to do some log file analysis and I need to store
on disk a very large dict with tuples of strings as keys and
lists of strings and numbers as values.

I started by using cPickle to save the instance of the class that
contained this dict, but the pickling process started to write
the file but ate so much memory that my computer (4 GB RAM)
crashed so badly that I had to press the reset button. I've never
seen out-of-memory errors do this before. Is this normal?

(I know from the output that got written before the crash that my
program had finished building the dict and started the
pickle. When I tried running the other program that reads the
pickle and analyzes the data in it, it gave an error because the
file was incomplete. So I know where in my code the crash
happened.)

lot of memory because it checked for recursion and other things
that could break other serialization methods. So I've switched to
using marshal to save the dict itself (the only persistent thing
in the class, which just has convenience methods for adding data
to the dict and searching it for the second stage of analysis).

I found some references to h5 tables for getting around the
pickling memory problem, but I got the impression they only work
with fixed columns, not a somewhat complex data structure like
mine.

Any comments, suggestions?

Have you seen that one?

http://mail.python.org/pipermail/python-list/2008-July/1139855.html

Bob Fnord · Mar 12, 2011

Peter Otten said:
Bob Fnord wrote:

Have you seen that one?

http://mail.python.org/pipermail/python-list/2008-July/1139855.html

Not until now, but that's interesting. But I didn't even get a
backtrace, just a totally locked up computer!

Being able to incorporate any computers IP address in to a program loaded on to that computer	1	May 11, 2023
My http request is working but not doing it correctly	0	Oct 13, 2023
I need help fixing my website	2	Oct 15, 2023
How do I solidify my Python skills	1	Sep 15, 2023
Qsort() is messing with my entire Code!!!	0	Apr 25, 2022
Help with my responsive home page	2	Dec 14, 2022
My regex kung-fu is not strong =(	0	Apr 4, 2020
using a new computer and bringing needed libraries to it	4	May 18, 2014

my computer is allergic to pickles

Bob Fnord

MRAB

Bob Fnord

Mel

Martin P. Hellwig

Terry Reedy

Bob Fnord

Bob Fnord

Bob Fnord

Peter Otten

Bob Fnord

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads