my computer is allergic to pickles

B

Bob Fnord

I'm using python to do some log file analysis and I need to store
on disk a very large dict with tuples of strings as keys and
lists of strings and numbers as values.

I started by using cPickle to save the instance of the class that
contained this dict, but the pickling process started to write
the file but ate so much memory that my computer (4 GB RAM)
crashed so badly that I had to press the reset button. I've never
seen out-of-memory errors do this before. Is this normal?

(I know from the output that got written before the crash that my
program had finished building the dict and started the
pickle. When I tried running the other program that reads the
pickle and analyzes the data in it, it gave an error because the
file was incomplete. So I know where in my code the crash
happened.)
From searching the web, I get the impression that pickle uses a
lot of memory because it checked for recursion and other things
that could break other serialization methods. So I've switched to
using marshal to save the dict itself (the only persistent thing
in the class, which just has convenience methods for adding data
to the dict and searching it for the second stage of analysis).

I found some references to h5 tables for getting around the
pickling memory problem, but I got the impression they only work
with fixed columns, not a somewhat complex data structure like
mine.

Any comments, suggestions?
 
M

MRAB

I'm using python to do some log file analysis and I need to store
on disk a very large dict with tuples of strings as keys and
lists of strings and numbers as values.

I started by using cPickle to save the instance of the class that
contained this dict, but the pickling process started to write
the file but ate so much memory that my computer (4 GB RAM)
crashed so badly that I had to press the reset button. I've never
seen out-of-memory errors do this before. Is this normal?

(I know from the output that got written before the crash that my
program had finished building the dict and started the
pickle. When I tried running the other program that reads the
pickle and analyzes the data in it, it gave an error because the
file was incomplete. So I know where in my code the crash
happened.)

lot of memory because it checked for recursion and other things
that could break other serialization methods. So I've switched to
using marshal to save the dict itself (the only persistent thing
in the class, which just has convenience methods for adding data
to the dict and searching it for the second stage of analysis).

I found some references to h5 tables for getting around the
pickling memory problem, but I got the impression they only work
with fixed columns, not a somewhat complex data structure like
mine.

Any comments, suggestions?
Would a database work?
 
B

Bob Fnord

MRAB said:
Would a database work?

I want a portable data file (can be moved around the filesystem
or copied to another machine and used), so I don't want to use
mysql or postgres. I guess the "sqlite" approach would work, but
I think it would be difficult to turn the tuples of strings and
lists of strings and numbers into database table lines.

Would a database in a file have any advantages over a file made
by marshal or shelve?

I'm more worried about the fact that a python program in user
space can bring down the computer!
 
M

Mel

Bob said:
I want a portable data file (can be moved around the filesystem
or copied to another machine and used), so I don't want to use
mysql or postgres. I guess the "sqlite" approach would work, but
I think it would be difficult to turn the tuples of strings and
lists of strings and numbers into database table lines.

This is as hairy as it's ever got for me (untested):

def inserter (db, table_name, names, values):
query = 'INSERT INTO %s (%s) VALUES (%s)' % (table_name, ','.join
(names), ','.join (['?'] * len (names)))
cur = db.cursor()
cur.execute (query, values)
cur.close()
#...
for v in all_value_triples:
inserter (db, 'some_table', ['f1', 'f2', 'f3'], v)

(or even write a bulk_inserter that took all_value_triples as an argument
and moved the `for v in ...` inside the function.)
Would a database in a file have any advantages over a file made
by marshal or shelve?

Depends. An sqlite3 database file is usable by programs not written in
Python.
I'm more worried about the fact that a python program in user
space can bring down the computer!

Never been a problem in the programs I've written.

Mel.
 
M

Martin P. Hellwig

On 05/03/2011 01:56, Bob Fnord wrote:
Any comments, suggestions?
No but I have a bunch of pseudo-questions :)

What version of python are you using? How about your OS and bitspace
(32/64)? Have you also tried using the non-c pickle module? If the data
is very simple in structure, perhaps serializing to CSV might be an option?
 
T

Terry Reedy

I want a portable data file (can be moved around the filesystem
or copied to another machine and used),

Used only by Python or by other software?
Would a database in a file have any advantages over a file made
by marshal or shelve?

If you have read the initial paragraphs of the marshal doc and your
needs fit within its limitations, go ahead and use it. (Also note that
Python could switch to a new version in the future.)

Keyed databases have the advantage that you can change the data file. If
you do not need to do that (as opposed to read in, do whatever, and
write out in entirety) then that is no advantage to you.

Similar to marshal is json, which is more limited but more portable,
because understood by other languages.
 
B

Bob Fnord

Miki Tebeka said:
Or, which situations does shelve suit better and which does
marshal suit better?
shelve ease of use and the fact it uses the disk to store objects makes it a good choice if you have a lot of object, each with a unique string key (and a tuple of strings can be converted to and from a string).

db = shelve.open("/tmp/foo.db")
db["key1"] = (1, 2, 3)
...

Marshal is faster and IIRC more geared toward network operations. But I haven't used it that much ...
From looking at the shelve info in the library reference, I get
the impression it's tricky to change the values in the dict for
existing keys and be sure they get changed on disk. My dict lists
of strings and integers as values and the lists get changed as
the program analyzes the input files, then stored on disk in
their final form. I guess marshal is better for that.

How can you convert a tuple of strings to a string and back in a
reliable deterministic way? The original strings may have ' " ,
in them.
 
B

Bob Fnord

Terry Reedy said:
Used only by Python or by other software?

just Python
If you have read the initial paragraphs of the marshal doc and your
needs fit within its limitations, go ahead and use it. (Also note that
Python could switch to a new version in the future.)

OK, I think marshal is just what I need.
Keyed databases have the advantage that you can change the data file. If
you do not need to do that (as opposed to read in, do whatever, and
write out in entirety) then that is no advantage to you.

OK, thanks
 
B

Bob Fnord

Martin P. Hellwig said:
On 05/03/2011 01:56, Bob Fnord wrote:

No but I have a bunch of pseudo-questions :)

What version of python are you using? How about your OS and bitspace
(32/64)? Have you also tried using the non-c pickle module? If the data
is very simple in structure, perhaps serializing to CSV might be an option?

python 2.6.6

ubuntu 64 bit

The library ref says cPickle is "optimized" and "up to 1000 times
faster than pickle" but of course doesn't mention memory.

The data to save (and load in another python program) is a dict
with keys = tuples of strings (including " ' , and other
troublesome characters) and values = lists of strings and
integers. As the 1st program runs, it adds new keys AND changes
the contents of the value lists. (The 2nd program only reads the
dict into memory, analyzes it, and prints to STDOUT.)
 
P

Peter Otten

Bob said:
I'm using python to do some log file analysis and I need to store
on disk a very large dict with tuples of strings as keys and
lists of strings and numbers as values.

I started by using cPickle to save the instance of the class that
contained this dict, but the pickling process started to write
the file but ate so much memory that my computer (4 GB RAM)
crashed so badly that I had to press the reset button. I've never
seen out-of-memory errors do this before. Is this normal?

(I know from the output that got written before the crash that my
program had finished building the dict and started the
pickle. When I tried running the other program that reads the
pickle and analyzes the data in it, it gave an error because the
file was incomplete. So I know where in my code the crash
happened.)

lot of memory because it checked for recursion and other things
that could break other serialization methods. So I've switched to
using marshal to save the dict itself (the only persistent thing
in the class, which just has convenience methods for adding data
to the dict and searching it for the second stage of analysis).

I found some references to h5 tables for getting around the
pickling memory problem, but I got the impression they only work
with fixed columns, not a somewhat complex data structure like
mine.

Any comments, suggestions?

Have you seen that one?

http://mail.python.org/pipermail/python-list/2008-July/1139855.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top