marshal vs pickle

E

Evan Klitzke

The documentation for marshal makes it clear that there are no
guarantees about being able to correctly deserialize marshalled data
structures across Python releases. It also implies that marshal is not
a general "persistence" module. On the other hand, the documentation
seems to imply that marshalled objects act more or less like pickled
objects.

Can anyone elaborate more on the difference between marshal and
pickle. In what conditions would using marshal be unsafe? If one can
guarantee that the marshalled objects would be created and read by the
same version of Python, is that enough?
 
B

Bjoern Schliessmann

Evan said:
Can anyone elaborate more on the difference between marshal and
pickle. In what conditions would using marshal be unsafe? If one
can guarantee that the marshalled objects would be created and
read by the same version of Python, is that enough?

Just use pickle. From the docs:

| The marshal module exists mainly to support reading and writing
| the ``pseudo-compiled'' code for Python modules of .pyc files.
| Therefore, the Python maintainers reserve the right to modify the
| marshal format in backward incompatible ways should the need
| arise. If you're serializing and de-serializing Python objects,
| use the pickle module instead.

Regards,


Björn
 
A

Aaron Watters

Can anyone elaborate more on the difference between marshal and
pickle. In what conditions would using marshal be unsafe? If one can
guarantee that the marshalled objects would be created and read by the
same version of Python, is that enough?

Yes, I think that's enough. I like to use
marshal a lot because it's the absolutely fastest
way to store and load data to/from Python. Furthermore
because marshal is "stupid" the programmer has complete
control. A lot of the overhead you get with the
pickles which make them generally much slower than
marshal come from the cleverness by which pickle will
recognized shared objects and all that junk. When I
serialize, I generally don't need
that because I know what I'm doing.

For example both gadfly SQL

http://gadfly.sourceforge.net

and nucular full text/fielded search

http://nucular.sourceforge.net

use marshal as the underlying serializer. Using cPickle
would probably make serialization worse than 2x slower.
This is one of the 2 or 3 key tricks which make these
packages as fast as they are.

-- Aaron Watters

===
http://www.xfeedme.com/nucular/gut.py/go?FREETEXT=halloween
 
R

Raymond Hettinger

I like to use
marshal a lot because it's the absolutely fastest
way to store and load data to/from Python. Furthermore
because marshal is "stupid" the programmer has complete
control. A lot of the overhead you get with the
pickles which make them generally much slower than
marshal come from the cleverness by which pickle will
recognized shared objects and all that junk. When I
serialize,

I believe this FUD is somewhat out-of-date. Marshalling
became smarter about repeated and shared objects. The
pickle module (using mode 2) has a similar implementation
to marshal and both use the same tricks, but pickle is
much more flexible in the range of objects it can handle
(i.e. sets became marshalable only recently while deques
can pickle but not marshal)

For the most part, users are almost always better-off
using pickle which is version independent, fast, and
can handle many more types of objects than marshal.

Also FWIW, in most applications of pickling/marshaling,
the storage or tranmission times dominate computation
time. I've gotten nice speed-ups by zipping the pickle
before storing, transmitting, or sharing (RPC apps
for example).


Raymond
 
A

Aaron Watters

I believe this FUD is somewhat out-of-date. Marshalling
became smarter about repeated and shared objects. The
pickle module (using mode 2) has a similar implementation
to marshal

Raymond: happy days! We are both right!
I just ran some tests from the test suite for
http://nucular.sourceforge.net with marshalling
and pickling switched in and out and to my
surprise I didn't find too much difference
on the "load" end (marshalling 10% faster),
but for the "bigLtreeTest.py" I found that
the build ("dump") process was about 1/3
slower with cPickle (mode 2/python2.4). For
the more complex tests (mondial and gutenberg)
I found that the speed up for using marshal was
in the 1-2% range (and sometimes inverted
because of processor load I think, on a shared
hosting machine).

I'm pretty sure things were much worse for cPickle
many moons ago. Nice to see that some things
get better :). It makes sense that the
"dump" side would be slower because that's
where you need to remember all the objects
in case you see them again...

Anyway since it's easy and makes sense I think
the next version of nucular will have a
switchable option between marshal and cPickle
for persistant storage.

Thanks! -- Aaron Watters

===
The pursuit of hypothetical performance
improvements is the root of all evil.
-- Bill Tutt
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=tutt
 
R

Raymond Hettinger

Anyway since it's easy and makes sense I think
the next version of nucular will have a
switchable option between marshal and cPickle
for persistant storage.

Makes more sense to use cPickle and be done with it.

FWIW, I've updated the docs to be absolutely clear on the subject:

'''
This is not a general "persistence" module. For general persistence
and
transfer of Python objects through RPC calls, see the
modules :mod:`pickle` and
:mod:`shelve`. The :mod:`marshal` module exists mainly to support
reading and
writing the "pseudo-compiled" code for Python modules of :file:`.pyc`
files.
Therefore, the Python maintainers reserve the right to modify the
marshal format
in backward incompatible ways should the need arise. If you're
serializing and
de-serializing Python objects, use the :mod:`pickle` module instead --
the
performance is comparable, version independence is guaranteed, and
pickle
supports a substantially wider range of objects than marshal.

... warning::

The :mod:`marshal` module is not intended to be secure against
erroneous or
maliciously constructed data. Never unmarshal data received from
an
untrusted or unauthenticated source.

Not all Python object types are supported; in general, only objects
whose value
is independent from a particular invocation of Python can be written
and read by
this module. The following types are supported: ``None``, integers,
long
integers, floating point numbers, strings, Unicode objects, tuples,
lists,
dictionaries, and code objects, where it should be understood that
tuples, lists
and dictionaries are only supported as long as the values contained
therein are
themselves supported; and recursive lists and dictionaries should not
be written
(they will cause infinite loops).

... warning::

Some unsupported types such as subclasses of builtins will appear
to marshal
and unmarshal correctly, but in fact, their type will change and
the
additional subclass functionality and instance attributes will be
lost.

... warning::

On machines where C's ``long int`` type has more than 32 bits (such
as the
DEC Alpha), it is possible to create plain Python integers that are
longer
than 32 bits. If such an integer is marshaled and read back in on a
machine
where C's ``long int`` type has only 32 bits, a Python long integer
object
is returned instead. While of a different type, the numeric value
is the
same. (This behavior is new in Python 2.2. In earlier versions,
all but the
least-significant 32 bits of the value were lost, and a warning
message was
printed.)
'''
 
G

Gabriel Genellina

FWIW, I've updated the docs to be absolutely clear on the subject:

As you are into it, the list of supported types should be updated too:
The following types are supported: ``None``, integers,
long
integers, floating point numbers, strings, Unicode objects, tuples,
lists,
dictionaries, and code objects,

boolean, complex, set and frozenset are missing.
 
P

Paul Rubin

Raymond Hettinger said:
''' This is not a general "persistence" module. For general
persistence and transfer of Python objects through RPC calls, see
the modules :mod:`pickle` and :mod:`shelve`.

That advice should be removed since Python currently does not have a
general persistence or transfer module in its stdlib. There's been an
open bug/RFE about it for something like 5 years. The issue is that
any sensible general purpose RPC mechanism MUST make reasonable
security assertions that nothing bad happens if you deserialize
untrusted data. The pickle module doesn't make such guarantees and in
fact its documentation explicitly warns against unpickling untrusted
data. Therefore pickle should not be used as a general RPC
mechanism.
 
A

Aaron Watters

That advice should be removed since Python currently does not have a
general persistence or transfer module in its stdlib. There's been an
open bug/RFE about it for something like 5 years. The issue is that
any sensible general purpose RPC mechanism MUST make reasonable
security assertions that nothing bad happens if you deserialize
untrusted data. The pickle module doesn't make such guarantees and in
fact its documentation explicitly warns against unpickling untrusted
data. Therefore pickle should not be used as a general RPC
mechanism.

This is absolutely correct. Marshal is more secure than pickle
because marshal *cannot* execute code automatically whereas pickle
does. The assertion that marshal is less secure than pickle is
absurd.

This is exactly why the gadfly server mode uses marshal and not
pickle.

-- Aaron Watters

===
why do you hang out with that sadist?
beats me! -- kliban
 
A

Aaron Watters

Makes more sense to use cPickle and be done with it.

FWIW, I've updated the docs to be absolutely clear on the subject:

'''
This is not a general "persistence" module. For general persistence
and...

Alright already. Here is the patched file you want

http://nucular.sourceforge.net/kisstree_pickle.py

This will make all your nucular indices portable across python
versions and machine architectures. I'll add this to the
next release with a bunch of other stuff too.

By the way there is another module that uses marshal for
strictly temporary storage in http://nucular.sourceforge.net
-- but if I change that one the build time for nucular indices
fully DOUBLES!! That's too much pain for me. Sorry.

Also, it's always been a mystery to me why Python can't
keep the marshal module backwards compatible and portable.
You folks seem like pretty smart programmers to me. If
you need help, let me know. It's a damn shame Python doesn't
have a serialization module with the safety, speed, and
simplicity of marshal and also the portability of pickle.
I guess I have to live with it :(.
-- Aaron Watters

===
Wow, do you play basketball?
No, do you play miniature golf?
-- seen in Newsweek years ago
 
R

Raymond Hettinger

Marshal is more secure than pickle

"More" or "less" make little sense in a security context which
typically is an all or nothing affair. Neither module is designed for
security. From the docs for marshal:

'''
Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.
'''

If security is a focus, then use xmlrpc or some other tool that
doesn't construct arbitrary code objects.

I don't think you are doing the OP any favors by giving advice in
contravention of the docs and against the intended purpose of the two
modules. Bjoern's post covered the topic succinctly and accurately.


Raymond
 
A

Aaron Watters

"More" or "less" make little sense in a security context which
typically is an all or nothing affair. Neither module is designed for
security. From the docs for marshal:

'''
Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.
'''

If security is a focus, then use xmlrpc or some other tool that
doesn't construct arbitrary code objects.

I disagree. Xmlrpc is insecure if you compile
and execute one of the strings
you get from it. Marshal is similarly insecure if you evaluate a code
object it hands you. If you aren't that dumb, then neither one
is a problem. As far as I'm concerned marshal.load is not any
more insecure than file.read.

Pickle on the other hand can execute just about anything without
you knowing anything about it. It is a horrendous mistake
to suggest that anyone should implement RPC using pickle. If they
want it to be fast they can use marshal, except for that thing
about non-portability which was a design mistake, imho.

By the way: here is a test program which shows pickle running
4 times slower than marshal on my machine using python 2.5.1:

"""
import marshal
import cPickle
import time

def pdump(value, f):
#cPickle.dump(value, f, 2)
return cPickle.dumps(value, 2)

def mdump(value, f):
#marshal.dump(value, f)
return marshal.dumps(value)

def test(dump, fn):
now = time.time()
#f = open(fn, "wb")
f = None
for i in range(3):
D = {}
for j in range(200000):
k = (i*133+j*119)%151
D[ (str(k),str(j)) ] = (str(i), [k, str(k)])
dump(D.items(), f)
#f.close()
elapsed = time.time()-now
print dump, elapsed

if __name__=="__main__":
test(mdump, "mdump.dat")
test(pdump, "ptemp.dat")
"""

-- Aaron Watters
===
If you think you are smart enough to write multi-threaded programs
you're not. -- Jim Ahlstrom's corollary to Murphy's Law.

http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=ahlstrom
 
A

Aaron Watters

I disagree. Xmlrpc is insecure if you compile
and execute one of the strings
you get from it. Marshal is similarly insecure if you evaluate a code
object it hands you. If you aren't that dumb, then neither one
is a problem. As far as I'm concerned marshal.load is not any
more insecure than file.read.

You're mistaken.

$ python
Python 2.4.3 (#2, Oct 6 2006, 07:52:30)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Segmentation fault

Plenty of other nasty stuff can happen when you call marshal.loads, too.

I'll grant you the above as a denial of service attack. You are right
that I was mistaken in that sense. (btw, it doesn't core dump for
2.5.1)

That is/was a bug in marshal. Someone should fix it. Properly
implemented,
marshal is not fundamentally insecure. Can you give me an example
where someone can erase the filesystem using marshal.load? I saw one
for pickle.load once.

-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=chocolate
 
P

Paul Rubin

Aaron Watters said:
I'll grant you the above as a denial of service attack. ...
Can you give me an example
where someone can erase the filesystem using marshal.load?

You should always assume that if an attacker can induce a memory fault
(typically through a buffer overflow) then s/he can inject and run
arbitrary machine code and take over the process. It's not even worth
looking for a specific exploit--this type of thing MUST be fixed if
the function can be exposed to untrusted data. Yes it should be
possible to fix the segfault in marshal--but in principle pickle could
be locked down as well, at least from these code injection attacks.
It's just something the python stdlib doesn't currently address, for
whatever reason.

BTW, if denial of service counts, I think that you also have to check for
algorithmic complexity attacks against Python dictionary objects.
I.e. by constructing a serialized dictionary whose keys all hash to
the same number, you can possibly make the deserializer use quadratic
runtime, bringing the remote process to its knees with a dictionary of
a few million elements, a not-unreasonable size for applications like
database dumps. (I haven't checked yet what actually happens in
practice if you try this, given that the already-known problems with
pickle and marshal are even worse). This can't really be fixed in the
serialization format. Either the deserializer should run in a
controlled environment (enforced resource bounds) or (preferably) the
underlying dict implementation should change to resist this attack.

For more info, see: http://www.cs.rice.edu/~scrosby/hash/
 
A

Aaron Watters

You should always assume that if an attacker can induce a memory fault
(typically through a buffer overflow) then s/he can inject and run
arbitrary machine code ...

Yes yes yes, but this takes an extraordinary amount of skill
and criminal malice. With pickle an innocent person
on another continent could potentially delete all the files
on your computer by accident.

In summary my view is this.

- pickle is way too complicated and not worth the
extra overhead and danger in most cases.

- marshal is an excellent tool for getting
large amounts of data in and out of Python that
can be much faster than pickle and is always
much less dangerous than pickle. I think it's safe
enough for most RPC uses, for example.

- It's a damn shame that the Python developers
can't be bothered to make marshal portable across
platforms and versions. It's a silly mistake.

Sorry for all the fuss.

-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=limiting+perl
 
A

Aaron Watters

Alright already. Here is the patched file you want

http://nucular.sourceforge.net/kisstree_pickle.py

This file has been removed. After consideration,
I don't want to create the moral hazard that someone
might distribute automatically executed
malicious code pickled inside a nucular index.
If you grabbed it, please destroy it.
I'm going back to using marshal. I'd like to thank
Raymond and others for motivating me to think this over.
The possibilities for abuse are astounding.

Honestly, if you download any package containing a
pickle: delete it.

-- Aaron Watters

===
How many mice does it take to screw in a light bulb?
2.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,104
Latest member
LesliVqm09
Top