pickle alternative

Discussion in 'Python' started by simonwittber, May 31, 2005.

  1. simonwittber

    simonwittber Guest

    I've written a simple module which serializes these python types:

    IntType, TupleType, StringType, FloatType, LongType, ListType, DictType

    It available for perusal here:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/415503

    It appears to work faster than pickle, however, the decode process is
    much slower (5x) than the encode process. Has anyone got any tips on
    ways I might speed this up?


    Sw.
     
    simonwittber, May 31, 2005
    #1
    1. Advertisements

  2. simonwittber

    Andrew Dalke Guest

    For simple data types consider "marshal" as an alternative to "pickle".

    def dec_int_type(data):
    value = int(unpack('!i', data.read(4))[0])
    return value

    That 'int' isn't needed -- unpack returns an int not a string
    representation of the int.

    BTW, your code won't work on 64 bit machines.

    def enc_long_type(obj):
    return "%s%s%s" % ("B", pack("!L", len(str(obj))), str(obj))

    There's no need to compute str(long) twice -- for large longs
    it takes a lot of work to convert to base 10. For that matter,
    it's faster to convert to hex, and the hex form is more compact.

    Every decode you do requires several function calls. While
    less elegant, you'll likely get better performance (test it!)
    if you minimize that; try something like this

    def decode(data):
    return _decode(StringIO(data).read)

    def _decode(read, unpack = struct.unpack):
    code = read(1)
    if not code:
    raise IOError("reached the end of the file")
    if code == "I":
    return unpack("!i", read(4))[0]
    if code == "F":
    return unpack("!f", read(4))[0]
    if code == "L":
    count = unpack("!i", read(4))
    return [_decode(read) for i in range(count)]
    if code == "D":
    count = unpack("!i", read(4))
    return dict([_decode(read) for i in range(count)]
    ...



    Andrew
     
    Andrew Dalke, May 31, 2005
    #2
    1. Advertisements

  3. simonwittber

    simonwittber Guest

    For simple data types consider "marshal" as an alternative to "pickle".
    Warning: The marshal module is not intended to be secure against
    erroneous or maliciously constructed data. Never unmarshal data
    received from an untrusted or unauthenticated source.
    Any idea how this might be solved? The number of bytes used has to be
    consistent across platforms. I guess this means I cannot use the struct
    module?
    Thanks for the tip.

    Sw.
     
    simonwittber, May 31, 2005
    #3
  4. simonwittber

    Andrew Dalke Guest

    Ahh, I had forgotten that. Though I can't recall what an attack
    might be, I think it's because the C code hasn't been fully vetted
    for unexpected error conditions.
    How do you want to solve it? Should a 64 bit machine be able to read
    a data stream made on a 32 bit machine? What about vice versa? How
    are floats interconverted?

    You could preface the output stream with a description of the encoding
    used: version number, size of float, size of int (which should always
    be sizeof float these days, I think). Read these then use that
    information to figure out which decode/dispatch function to use.

    Andrew
     
    Andrew Dalke, May 31, 2005
    #4
  5. simonwittber

    simonwittber Guest

    I tried out the marshal module anyway.

    marshal can serialize small structures very qucikly, however, using the
    below test value:

    value = [r for r in xrange(1000000)] +
    [{1:2,3:4,5:6},{"simon":"wittber"}]

    marshal took 7.90 seconds to serialize it into a 5000061 length string.
    decode took 0.08 seconds.

    The aforementioned recipe took 2.53 seconds to serialize it into a
    5000087 length string. decode took 5.16 seconds, which is much longer
    than marshal!!

    Sw.
     
    simonwittber, May 31, 2005
    #5
  6. simonwittber

    Andrew Dalke Guest

    Strange. Here's what I found:
    I can't reproduce your large times for marshal.dumps. Could you
    post your test code?

    Andrew
     
    Andrew Dalke, Jun 1, 2005
    #6
  7. simonwittber

    simonwittber Guest

    I can't reproduce your large times for marshal.dumps. Could you

    Certainly:

    import sencode
    import marshal
    import time

    value = [r for r in xrange(1000000)] +
    [{1:2,3:4,5:6},{"simon":"wittber"}]

    t = time.clock()
    x = marshal.dumps(value)
    print "marshal enc T:", time.clock() - t

    t = time.clock()
    x = marshal.loads(x)
    print "marshal dec T:", time.clock() - t

    t = time.clock()
    x = sencode.dumps(value)
    print "sencode enc T:", time.clock() - t
    t = time.clock()
    x = sencode.loads(x)
    print "sencode dec T:", time.clock() - t
     
    simonwittber, Jun 1, 2005
    #7
  8. simonwittber

    Andrew Dalke Guest

    simonwittber posted his test code.

    I tooks the code from the cookbook, called it "sencode" and
    added these two lines

    dumps = encode
    loads = decode


    I then ran your test code (unchanged except that my newsreader
    folded the "value = ..." line) and got

    marshal enc T: 0.21
    marshal dec T: 0.4
    sencode enc T: 7.76
    sencode dec T: 11.56

    This is with Python 2.3; the stock one provided by Apple
    for my Mac.

    I expected the numbers to be like this because the marshal
    code is used to make and read the .pyc files and is supposed
    to be pretty fast.

    BTW, I tried the performance approach I outlined earlier.
    The numbers aren't much better

    marshal enc T: 0.2
    marshal dec T: 0.38
    sencode2 enc T: 7.16
    sencode2 dec T: 9.49


    I changed the format a little bit; dicts are treated a bit
    differently.


    from struct import pack, unpack
    from cStringIO import StringIO

    class EncodeError(Exception):
    pass
    class DecodeError(Exception):
    pass

    def encode(data):
    f = StringIO()
    _encode(data, f.write)
    return f.getvalue()

    def _encode(data, write, pack = pack):
    # The original code use the equivalent of "type(data) is list"
    # I preserve that behavior

    T = type(data)

    if T is int:
    write("I")
    write(pack("!i", data))
    elif T is list:
    write("L")
    write(pack("!L", len(data)))
    # Assumes len and 'for ... in' aren't lying
    for item in data:
    _encode(item, write)
    elif T is tuple:
    write("T")
    write(pack("!L", len(data)))
    # Assumes len and 'for ... in' aren't lying
    for item in data:
    _encode(item, write)
    elif T is str:
    write("S")
    write(pack("!L", len(data)))
    write(data)
    elif T is long:
    s = hex(data)[2:-1]
    write("B")
    write(pack("!i", len(s)))
    write(s)
    elif T is type(None):
    write("N")
    elif T is float:
    write("F")
    write(pack("!f", data))
    elif T is dict:
    write("D")
    write(pack("!L", len(data)))
    for k, v in data.items():
    _encode(k, write)
    _encode(v, write)
    else:
    raise EncodeError((data, T))


    def decode(s):
    """
    Decode a binary string into the original Python types.
    """
    buffer = StringIO(s)
    return _decode(buffer.read)

    def _decode(read, unpack = unpack):
    code = read(1)
    if code == "I":
    return unpack("!i", read(4))[0]
    if code == "D":
    size = unpack("!L", read(4))[0]
    x = [_decode(read) for i in range(size*2)]
    return dict(zip(x[0::2], x[1::2]))
    if code == "T":
    size = unpack("!L", read(4))[0]
    return tuple([_decode(read) for i in range(size)])
    if code == "L":
    size = unpack("!L", read(4))[0]
    return [_decode(read) for i in range(size)]
    if code == "N":
    return None
    if code == "S":
    size = unpack("!L", read(4))[0]
    return read(size)
    if code == "F":
    return unpack("!f", read(4))[0]
    if code == "B":
    size = unpack("!L", read(4))[0]
    return long(read(size), 16)
    raise DecodeError(code)



    dumps = encode
    loads = decode


    I wonder if this could be improved by a "struct2" module
    which could compile a pack/unpack format once. Eg,

    float_struct = struct2.struct("!f")

    float_struct.pack(f)
    return float_struct.unpack('?\x80\x00\x00')[0]
    which might the same as
    return float_struct.unpack1('?\x80\x00\x00')



    Andrew
     
    Andrew Dalke, Jun 1, 2005
    #8
  9. simonwittber

    simonwittber Guest

    Ahh that is the difference. I'm running Python 2.4. I've checked my
    benchmarks on a friends machine, also in Python 2.4, and received the
    same results as my machine.
    It would appear that the new version 1 format introduced in Python 2.4
    is much slower than version 0, when using the dumps function.

    Thanks for your feedback Andrew!

    Sw.
     
    simonwittber, Jun 1, 2005
    #9
  10. simonwittber

    Andrew Dalke Guest

    Interesting. Hadn't noticed that change. Is dump(StringIO()) as
    slow?

    Andrew
     
    Andrew Dalke, Jun 1, 2005
    #10
  11. Not so for me. My benchmarks suggest no change between 2.3 and 2.4.

    Reinhold
     
    Reinhold Birkenfeld, Jun 1, 2005
    #11
  12. simonwittber

    mdoukidis Guest

    Running stest.py produced these results for me:

    marshal enc T: 12.5195908977
    marshal dec T: 0.134508715493
    sencode enc T: 3.75118904777
    sencode dec T: 5.86602012267
    11.9369997978
    0.109000205994
    True

    Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
    on win32

    Notice the slow "marshal enc"oding.
    Overall this recipe is faster than marshall for me.

    Mark
     
    mdoukidis, Jun 2, 2005
    #12
  13. simonwittber

    Paul Rubin Guest

    I think you should implement it as a C extension and/or write a PEP.
    This has been an unfilled need in Python for a while (SF RFE 467384).

    Note that using marshal is inappropriate, not only for security
    reasons, but because marshal is explicitly NOT guaranteed to
    interoperate across differing Python versions. You cannot assume that
    an object marshalled in Python 2.4 will unmarshal correctly in 2.5.
     
    Paul Rubin, Jun 2, 2005
    #13
  14. simonwittber

    simonwittber Guest

    I think you should implement it as a C extension and/or write a PEP.
    I've submitted a proto PEP to python-dev. It coming up against many of
    the same objections to the RFE.

    Sw.
     
    simonwittber, Jun 19, 2005
    #14
  15. simonwittber

    Paul Rubin Guest

    See also bug# 471893 where jhylton suggests a PEP. Something really
    ought to be done about this.
     
    Paul Rubin, Jun 19, 2005
    #15
  16. simonwittber

    simonwittber Guest

    I know this, you know this... I don't understand why the suggestion is
    meeting so much resistance. This is something I needed for a real world
    system which moves lots of data around to untrusted clients. Surely
    other people have had similar needs? Pickle and xmlrpclib simply are
    not up to the task, but, perhaps Joe Programmer is content to use a
    pickle, and not care for the security issues.

    Owell. I'm not sure what I can say to make the case any clearer...


    Sw.
     
    simonwittber, Jun 19, 2005
    #16
  17. simonwittber

    simonwittber Guest

    simonwittber, Jun 19, 2005
    #17
  18. simonwittber

    Paul Rubin Guest

    I don't think there's serious objection to a PEP, but I don't read
    python-dev. Maybe there was objection to some specific technical
    point in your PEP. Why don't you post it here?
     
    Paul Rubin, Jun 19, 2005
    #18
  19. simonwittber

    Paul Rubin Guest

    It would be nice if you just posted the PEP.
     
    Paul Rubin, Jun 19, 2005
    #19
  20. simonwittber

    simonwittber Guest

    Ok, I've attached the proto PEP below.

    Comments on the proto PEP and the implementation are appreciated.

    Sw.



    Title: Secure, standard serialization of simple python types.

    Abstract

    This PEP suggests the addition of a module to the standard library,
    which provides a serialization class for simple Python types.


    Copyright

    This document is placed in the public domain.


    Motivation

    The standard library currently provides two modules which are used
    for object serialization. Pickle is not secure by its very nature,
    and the marshal module is clearly marked as being not secure in the
    documentation. The marshal module does not guarantee compatibility
    between Python versions. The proposed module will only serialize
    simple built-in Python types, and provide compatibility across
    Python versions.

    See RFE 467384 (on SourceForge) for more discussion on the above
    issues.


    Specification

    The proposed module should use the same API as the marshal module.

    dump(value, file)
    #serialize value, and write to open file object
    load(file)
    #read data from file object, unserialize and return an object
    dumps(value)
    #return the string that would be written to the file by dump
    loads(value)
    #unserialize and return object


    Reference Implementation

    http://metaplay.dyndns.org:82/~simon/gherkin.py.txt


    Rationale

    The marshal documentation explicitly states that it is unsuitable
    for unmarshalling untrusted data. It also explicitly states that
    the format is not compatible across Python versions.

    Pickle is compatible across versions, but also unsafe for loading
    untrusted data. Exploits demonstrating pickle vulnerability exist.

    xmlrpclib provides serialization functions, but is unsuitable when
    serializing large data structures, or when high performance is a
    requirement. If performance is an issue, a C-based accelerator
    module can be installed. If size is an issue, gzip can be used,
    however, this creates a mutually exclusive size/performance
    trade-off.

    Other existing formats, such as JSON and Bencode (bittorrent) do
    not handle some marginally complex python structures and/or all
    the simple Python types.

    Time and space efficiency, and security do not have to be mutually
    exclusive features of a serializer. Python does not provide, in the
    standard library, a serializer which can work safely with untrusted
    data which is time and space efficient. The proposed gherkin module
    goes some way to achieving this. The format is simple enough to
    easily write interoperable implementations across platforms.
     
    simonwittber, Jul 5, 2005
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.