marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

Discussion in 'Python' started by bkustel@gmail.com, Jun 15, 2008.

  1. Guest

    I'm stuck on a problem where I want to use marshal for serialization
    (yes, yes, I know (c)Pickle is normally recommended here). I favor
    marshal for speed for the types of data I use.

    However it seems that marshal.dumps() for large objects has a
    quadratic performance issue which I'm assuming is that it grows its
    memory buffer in constant increments. This causes a nasty slowdown for
    marshaling large objects. I thought I would get around this by passing
    a cStringIO.StringIO object to marshal.dump() instead but I quickly
    learned this is not supported (only true file objects are supported).

    Any ideas about how to get around the marshal quadratic issue? Any
    hope for a fix for that on the horizon? Thanks for any information.
    , Jun 15, 2008
    #1
    1. Advertising

  2. TheSaint Guest

    Re: marshal.dumps quadratic growth and marshal.dump not allowing file-like objects

    On 16:04, domenica 15 giugno 2008 wrote:

    > cStringIO.StringIO object to marshal.dump() instead but I quickly
    > learned this is not supported (only true file objects are supported).
    >
    > Any ideas about how to get around the marshal quadratic issue? Any
    > hope for a fix for that on the horizon?

    If you zip the cStringIO.StringIO object, would it be possible?

    --
    Mailsweeper Home : http://it.geocities.com/call_me_not_now/index.html
    TheSaint, Jun 15, 2008
    #2
    1. Advertising

  3. Peter Otten Guest

    Re: marshal.dumps quadratic growth and marshal.dump not allowing file-like objects

    wrote:

    > I'm stuck on a problem where I want to use marshal for serialization
    > (yes, yes, I know (c)Pickle is normally recommended here). I favor
    > marshal for speed for the types of data I use.
    >
    > However it seems that marshal.dumps() for large objects has a
    > quadratic performance issue which I'm assuming is that it grows its
    > memory buffer in constant increments. This causes a nasty slowdown for
    > marshaling large objects. I thought I would get around this by passing
    > a cStringIO.StringIO object to marshal.dump() instead but I quickly
    > learned this is not supported (only true file objects are supported).
    >
    > Any ideas about how to get around the marshal quadratic issue? Any
    > hope for a fix for that on the horizon? Thanks for any information.


    Here's how marshal resizes the string:

    newsize = size + size + 1024;
    if (newsize > 32*1024*1024) {
    newsize = size + 1024*1024;
    }

    Maybe you can split your large objects and marshal multiple objects to keep
    the size below the 32MB limit.

    Peter
    Peter Otten, Jun 15, 2008
    #3
  4. On Jun 15, 1:04 am, wrote:
    > However it seems that marshal.dumps() for large objects has a
    > quadratic performance issue which I'm assuming is that it grows its
    > memory buffer in constant increments.


    Looking at the source in http://svn.python.org/projects/python/trunk/Python/marshal.c
    , it looks like the relevant fragment is in w_more():

    . . .
    size = PyString_Size(p->str);
    newsize = size + size + 1024;
    if (newsize > 32*1024*1024) {
    newsize = size + 1024*1024;
    }
    if (_PyString_Resize(&p->str, newsize) != 0) {
    . . .

    When more space is needed, the resize operation over-allocates by
    double the previous need plus 1K. This should give amortized O(1)
    performance just like list.append().

    However, when that strategy requests more than 32Mb, the resizing
    becomes less aggressive and grows only in 1MB blocks and giving your
    observed nasty quadratic behavior.

    Raymond
    Raymond Hettinger, Jun 15, 2008
    #4
  5. John Machin Guest

    On Jun 15, 7:47 pm, Peter Otten <> wrote:
    > wrote:
    > > I'm stuck on a problem where I want to use marshal for serialization
    > > (yes, yes, I know (c)Pickle is normally recommended here). I favor
    > > marshal for speed for the types of data I use.

    >
    > > However it seems that marshal.dumps() for large objects has a
    > > quadratic performance issue which I'm assuming is that it grows its
    > > memory buffer in constant increments. This causes a nasty slowdown for
    > > marshaling large objects. I thought I would get around this by passing
    > > a cStringIO.StringIO object to marshal.dump() instead but I quickly
    > > learned this is not supported (only true file objects are supported).

    >
    > > Any ideas about how to get around the marshal quadratic issue? Any
    > > hope for a fix for that on the horizon? Thanks for any information.

    >
    > Here's how marshal resizes the string:
    >
    > newsize = size + size + 1024;
    > if (newsize > 32*1024*1024) {
    > newsize = size + 1024*1024;
    > }
    >
    > Maybe you can split your large objects and marshal multiple objects to keep
    > the size below the 32MB limit.
    >


    But that change went into the svn trunk on 11-May-2008; perhaps the OP
    is using a production release which would have the previous version,
    which is merely "newsize = size + 1024;".

    Do people really generate 32MB pyc files, or is stopping doubling at
    32MB just a safety valve in case someone/something runs amok?

    Cheers,
    John
    John Machin, Jun 15, 2008
    #5
  6. Peter Otten Guest

    Re: marshal.dumps quadratic growth and marshal.dump not allowing file-like objects

    John Machin wrote:

    >> Here's how marshal resizes the string:
    >>
    >> newsize = size + size + 1024;
    >> if (newsize > 32*1024*1024) {
    >> newsize = size + 1024*1024;
    >> }
    >>
    >> Maybe you can split your large objects and marshal multiple objects to
    >> keep the size below the 32MB limit.
    >>

    >
    > But that change went into the svn trunk on 11-May-2008; perhaps the OP
    > is using a production release which would have the previous version,
    > which is merely "newsize = size + 1024;".


    That is indeed much worse. Depending on what the OP means by "large objects"
    the problem may be fixed in subversion then.

    > Do people really generate 32MB pyc files, or is stopping doubling at
    > 32MB just a safety valve in case someone/something runs amok?


    A 32MB pyc would correspond to a module of roughly the same size. So
    someone/something runs amok in either case.

    Peter
    Peter Otten, Jun 15, 2008
    #6
  7. Raymond Hettinger wrote:
    > When more space is needed, the resize operation over-allocates by
    > double the previous need plus 1K. This should give amortized O(1)
    > performance just like list.append().
    >
    > However, when that strategy requests more than 32Mb, the resizing
    > becomes less aggressive and grows only in 1MB blocks and giving your
    > observed nasty quadratic behavior.


    The marshal code has been revamped in Python 2.6. The old code in Python
    2.5 uses a linear growth strategy:

    size = PyString_Size(p->str);
    newsize = size + 1024;
    if (_PyString_Resize(&p->str, newsize) != 0) {
    p->ptr = p->end = NULL;
    }

    Anyway marshal should not be used by user code to serialize objects.
    It's only meant for Python byte code. Please use the pickle/cPickle
    module instead.

    Christian
    Christian Heimes, Jun 15, 2008
    #7
  8. Guest

    On Jun 15, 3:16 am, John Machin <> wrote:
    > But that change went into the svn trunk on 11-May-2008; perhaps the OP
    > is using a production release which would have the previous version,
    > which is merely "newsize = size + 1024;".
    >
    > Do people really generate 32MB pyc files, or is stopping doubling at
    > 32MB just a safety valve in case someone/something runs amok?


    Indeed. I (the OP) am using a production release which has the 1k
    linear growth.
    I am seeing the problems with ~5MB and ~10MB sizes.
    Apparently this will be improved greatly in Python 2.6, at least up to
    the 32MB limit.

    Thanks all for responding.
    , Jun 15, 2008
    #8
  9. John Machin Guest

    On Jun 16, 1:08 am, wrote:
    > On Jun 15, 3:16 am, John Machin <> wrote:
    >
    > > But that change went into the svn trunk on 11-May-2008; perhaps the OP
    > > is using a production release which would have the previous version,
    > > which is merely "newsize = size + 1024;".

    >
    > > Do people really generate 32MB pyc files, or is stopping doubling at
    > > 32MB just a safety valve in case someone/something runs amok?

    >
    > Indeed. I (the OP) am using a production release which has the 1k
    > linear growth.
    > I am seeing the problems with ~5MB and ~10MB sizes.
    > Apparently this will be improved greatly in Python 2.6, at least up to
    > the 32MB limit.


    Apparently you intend to resist good advice and persist [accidental
    pun!] with marshal -- how much slower is cPickle for various sizes of
    data? What kinds of objects are you persisting?
    John Machin, Jun 15, 2008
    #9
  10. On Jun 15, 8:08 am, wrote:
    > Indeed. I (the OP) am using a production release which has the 1k
    > linear growth.
    > I am seeing the problems with ~5MB and ~10MB sizes.
    > Apparently this will be improved greatly in Python 2.6, at least up to
    > the 32MB limit.


    I've just fixed this for Py2.5.3 and Py2.6. No more quadratic
    behavior.


    Raymond
    Raymond Hettinger, Jun 16, 2008
    #10

  11. >
    > Anywaymarshalshould not be used by user code to serialize objects.
    > It's only meant for Python byte code. Please use the pickle/cPickle
    > module instead.
    >
    > Christian


    Just for yucks let me point out that marshal has
    no real security concerns of interest to the non-paranoid,
    whereas pickle is a security disaster waiting to happen
    unless you are extremely cautious... yet again.

    Sorry, I know a even a monkey learns after 3 times...

    -- Aaron Watters

    ===
    http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=disaster
    Aaron Watters, Jun 18, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. halfdog
    Replies:
    12
    Views:
    12,443
  2. Victor Kryukov
    Replies:
    8
    Views:
    486
    Gabriel Genellina
    May 17, 2007
  3. martin f krafft
    Replies:
    1
    Views:
    324
    Marc 'BlackJack' Rintsch
    Mar 17, 2008
  4. Michael Davis

    Ruby 1.8 and Marshal.load/Marshal.dump

    Michael Davis, Oct 10, 2003, in forum: Ruby
    Replies:
    0
    Views:
    167
    Michael Davis
    Oct 10, 2003
  5. Minkoo Seo
    Replies:
    1
    Views:
    118
Loading...

Share This Page