Marshal Obj is String or Binary?

Discussion in 'Python' started by Mike, Jan 13, 2006.

  1. Mike

    Mike Guest

    Hi,

    The example below shows that result of a marshaled data structure is
    nothing but a string

    >>> data = {2:'two', 3:'three'}
    >>> import marshal
    >>> bytes = marshal.dumps(data)
    >>> type(bytes)

    <type 'str'>
    >>> bytes

    '{i\x02\x00\x00\x00t\x03\x00\x00\x00twoi\x03\x00\x00\x00t\x05\x00\x00\x00three0'

    Now, I need to store this data safely in my database as CLEAR TEXT, not
    BLOB. It seems to me that it should work just fine since it is string
    anyways. So, why does O'reilly's Python Cookbook is insisting in saving
    it as a binary file and BLOB type?

    Am I missing out something?

    Thanks,
    Mike
     
    Mike, Jan 13, 2006
    #1
    1. Advertising

  2. In <>, Mike wrote:

    > The example below shows that result of a marshaled data structure is
    > nothing but a string
    >
    >>>> data = {2:'two', 3:'three'}
    >>>> import marshal
    >>>> bytes = marshal.dumps(data)
    >>>> type(bytes)

    > <type 'str'>
    >>>> bytes

    > '{i\x02\x00\x00\x00t\x03\x00\x00\x00twoi\x03\x00\x00\x00t\x05\x00\x00\x00three0'
    >
    > Now, I need to store this data safely in my database as CLEAR TEXT, not
    > BLOB. It seems to me that it should work just fine since it is string
    > anyways. So, why does O'reilly's Python Cookbook is insisting in saving
    > it as a binary file and BLOB type?
    >
    > Am I missing out something?


    Yes, that a string is *binary* data. But only a subset of strings is safe
    to use as `TEXT` in databases. Do you see all those '\x??' escapes?
    '\x00' is *one* byte! A byte with the value zero. Something your DB
    doesn't allow in a `TEXT` type.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Jan 13, 2006
    #2
    1. Advertising

  3. Mike

    Mike Guest

    Wait a sec. \x00 may represent a byte when unmarshaled, but as long as
    marshal likes it as \x00, I think my db is capable of storing \ x 0 0
    characters. What is the problem? Is it that \? I could escape that...
    actually I think my django framework already does that for me.

    Thanks,
    Mike
     
    Mike, Jan 14, 2006
    #3
  4. Mike

    Mike Guest

    Wait a sec. \x00 may represent a byte when unmarshaled, but as long as
    marshal likes it as \x00, I think my db is capable of storing \ x 0 0
    characters. What is the problem? Is it that \? I could escape that...
    actually I think my django framework already does that for me.

    Thanks,
    Mike
     
    Mike, Jan 14, 2006
    #4
  5. Mike

    Guest

    Try...

    >>> for i in bytes: print ord(i)


    or

    >>> len(bytes)


    What you see isn't always what you have. Your database is capable of
    storing \ x 0 0 characters, but your string contains a single byte of
    value zero. When Python displays the string representation to you, it
    escapes the values so they can be displayed.

    casevh
     
    , Jan 14, 2006
    #5
  6. wrote:

    > Try...
    >
    >>>> for i in bytes: print ord(i)

    >
    > or
    >
    >>>> len(bytes)

    >
    > What you see isn't always what you have. Your database is capable of
    > storing \ x 0 0 characters, but your string contains a single byte of
    > value zero. When Python displays the string representation to you, it
    > escapes the values so they can be displayed.


    He can still store the repr of the string into the database, and then
    reconstruct it with eval:

    >>> bytes = "\x00\x01\x02"
    >>> bytes

    '\x00\x01\x02'
    >>> len(bytes)

    3
    >>> ord(bytes[0])

    0
    >>> rb = repr(bytes)
    >>> rb

    "'\\x00\\x01\\x02'"
    >>> len(rb)

    14
    >>> rb[0]

    "'"
    >>> rb[1]

    '\\'
    >>> rb[2]

    'x'
    >>> rb[3]

    '0'
    >>> rb[4]

    '0'
    >>> bytes2 = eval(rb)
    >>> bytes == bytes2

    True

    --
    Giovanni Bajo
     
    Giovanni Bajo, Jan 14, 2006
    #6
  7. Mike

    Mike Guest

    Thanks everyone. It seems broken storing complex structures as escaped
    strings, but I think I'll take my changes.

    Thanks,
    Mike
     
    Mike, Jan 14, 2006
    #7
  8. On Fri, 13 Jan 2006 22:20:27 -0800, Mike wrote:

    > Thanks everyone. It seems broken storing complex structures as escaped
    > strings, but I think I'll take my changes.



    Have you read the marshal reference?

    http://docs.python.org/lib/module-marshal.html

    marshal doesn't store data as escaped strings, it stores them as binary
    strings. When you print the binary string to the console, unprintable
    characters are shown escaped.

    I'm guessing you probably want to use pickle instead of marshal. marshal
    is intended only for dealing with .pyc files, and has some important
    limitations. pickle is intended to be a general purpose serializer.


    --
    Steve.
     
    Steven D'Aprano, Jan 14, 2006
    #8
  9. Mike

    Max Guest

    Giovanni Bajo wrote:
    >>
    >>What you see isn't always what you have. Your database is capable of
    >>storing \ x 0 0 characters, but your string contains a single byte of
    >>value zero. When Python displays the string representation to you, it
    >>escapes the values so they can be displayed.

    >
    >
    > He can still store the repr of the string into the database, and then
    > reconstruct it with eval:
    >


    Yes, but len(repr('\x00')) is 4, while len('\x00') is 1. So if he uses
    BLOB his data will take almost a quarter of the space, compared to your
    method (stored as TEXT).

    --Max
     
    Max, Jan 14, 2006
    #9
  10. On Sat, 14 Jan 2006 12:36:59 +0200, Max wrote:

    >> He can still store the repr of the string into the database, and then
    >> reconstruct it with eval:
    >>

    >
    > Yes, but len(repr('\x00')) is 4, while len('\x00') is 1.


    Incorrect:

    >>> len(repr('\x00'))

    6
    >>> repr('\x00')

    "'\\x00'"



    > So if he uses
    > BLOB his data will take almost a quarter of the space, compared to your
    > method (stored as TEXT).


    Also incorrect. That depends utterly on which particular characters end up
    in the serialised data. You may or may not be able to predict what that
    mix may be.

    # nothing but printable data

    >>> s = ''.join(['a' for i in range(256)])
    >>> len(s)

    256
    >>> len(repr(s))

    258


    # nothing but unprintable data

    >>> s = ''.join(['\0' for i in range(256)])
    >>> len(s)

    256
    >>> len(repr(s))

    1026


    # one particular mix of both printable and unprintable data

    >>> s = ''.join([chr(i) for i in range(256)])
    >>> len(s)

    256
    >>> len(repr(s))

    737


    # a different mix of both printable and unprintable data

    >>> s = '+'.join([chr(i) for i in range(128)])
    >>> len(s)

    255
    >>> len(repr(s))

    352





    --
    Steven.
     
    Steven D'Aprano, Jan 14, 2006
    #10
  11. Max wrote:

    >>> What you see isn't always what you have. Your database is capable of
    >>> storing \ x 0 0 characters, but your string contains a single byte
    >>> of value zero. When Python displays the string representation to
    >>> you, it escapes the values so they can be displayed.

    >>
    >>
    >> He can still store the repr of the string into the database, and then
    >> reconstruct it with eval:
    >>

    >
    > Yes, but len(repr('\x00')) is 4, while len('\x00') is 1. So if he uses
    > BLOB his data will take almost a quarter of the space, compared to
    > your method (stored as TEXT).


    Sure, but he didn't ask for the best strategy to store the data into the
    database, he specified very clearly that he *can't* use BLOB, and asked how to
    tuse TEXT.
    --
    Giovanni Bajo
     
    Giovanni Bajo, Jan 14, 2006
    #11
  12. Mike

    Mike Guest

    Thanks everyone.

    Why Marshal & not Pickle: Well, Marshal is supposed to be faster. But
    then, if I wanted to do the whole repr()-eval() hack, I am already
    defeating the purpose by refusing to save bytes as bytes in terms of
    both size and speed.

    At this point, I am considering one of the following:
    - Save my structure as binary data, and reference the file from my db
    - Find a clean method of saving bytes into my db

    Thanks again,
    Mike
     
    Mike, Jan 14, 2006
    #12
  13. Mike

    Mike Meyer Guest

    "Giovanni Bajo" <> writes:
    > wrote:
    >> Try...
    >>>>> for i in bytes: print ord(i)

    >> or
    >>>>> len(bytes)

    >> What you see isn't always what you have. Your database is capable of
    >> storing \ x 0 0 characters, but your string contains a single byte of
    >> value zero. When Python displays the string representation to you, it
    >> escapes the values so they can be displayed.

    > He can still store the repr of the string into the database, and then
    > reconstruct it with eval:


    repr and eval are overkill for this, and as as result create a
    security hole. Using encode('string-escape') and
    decode('string-escape') will do the same job without the security
    hole:

    >>> bytes = '\x00\x01\x02'
    >>> bytes

    '\x00\x01\x02'
    >>> ord(bytes[0])

    0
    >>> rb = bytes.encode('string-escape')
    >>> rb

    '\\x00\\x01\\x02'
    >>> len(rb)

    12
    >>> rb[0]

    '\\'
    >>> bytes2 = rb.decode('string-escape')
    >>> bytes == bytes2

    True
    >>>


    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Jan 14, 2006
    #13
  14. On Sat, 14 Jan 2006 13:50:24 -0800, Mike wrote:

    > Thanks everyone.
    >
    > Why Marshal & not Pickle: Well, Marshal is supposed to be faster.


    Faster than cPickle?

    Even faster would be to write your code in assembly, and dump that
    ridiculously bloated database and just write everything to raw bytes on
    an unformatted disk. Of course, it might take the programmer a thousand
    times longer to actually write the program, and there will probably be
    hundreds of bugs in it, but the important thing is that you'll save three
    or four milliseconds at runtime.

    Right?

    Unless you've actually done proper measurements of the time taken, with
    realistic sample data, worrying about saving a byte here and a
    millisecond there is just wasting your time, and is often
    counter-productive. Optimization without measurement is as likely to
    result in slower, fatter performance as it is faster and leaner.

    marshal is not designed to be portable across versions. Do you *really*
    think it is a good idea to tie the data in your database to one specific
    version of Python?


    > But
    > then, if I wanted to do the whole repr()-eval() hack, I am already
    > defeating the purpose by refusing to save bytes as bytes in terms of
    > both size and speed.
    >
    > At this point, I am considering one of the following:
    > - Save my structure as binary data, and reference the file from my db
    > - Find a clean method of saving bytes into my db


    Your database either can handle binary data, or it can't.

    If it can, then just use pickle with a binary protocol and be done with it.

    If it can't, then just use pickle with a plain text protocol and be done
    with it.

    Either way, you have to find a way to translate your Python data
    structures into something that you can feed to the database. Your database
    can't automatically suck data structures out of Python's working memory!
    So why re-invent the wheel? marshal is not recommended, but if you can
    live with the limitations of marshal then it might do the job. But trying
    to optimise code that hasn't even been written yet is a sure way to
    trouble.


    --
    Steven.
     
    Steven D'Aprano, Jan 14, 2006
    #14
  15. Mike

    Steve Holden Guest

    Mike wrote:
    > Hi,
    >
    > The example below shows that result of a marshaled data structure is
    > nothing but a string
    >
    >
    >>>>data = {2:'two', 3:'three'}
    >>>>import marshal
    >>>>bytes = marshal.dumps(data)
    >>>>type(bytes)

    >
    > <type 'str'>
    >
    >>>>bytes

    >
    > '{i\x02\x00\x00\x00t\x03\x00\x00\x00twoi\x03\x00\x00\x00t\x05\x00\x00\x00three0'
    >
    > Now, I need to store this data safely in my database as CLEAR TEXT, not
    > BLOB. It seems to me that it should work just fine since it is string
    > anyways. So, why does O'reilly's Python Cookbook is insisting in saving
    > it as a binary file and BLOB type?
    >

    Well, the Cookbook isn't an exhaustive list of everything you can do
    with Python, it's just a record of some of the things people *have* done.

    I presume your database has no datatype that will store binary data of
    indeterminate length? Clearly that would be the most satisfactory solution.

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
     
    Steve Holden, Jan 15, 2006
    #15
  16. Mike

    Mike Guest

    > Even faster would be to write your code in assembly, and dump that
    > ridiculously bloated database and just write everything to raw bytes on
    > an unformatted disk. Of course, it might take the programmer a thousand
    > times longer to actually write the program, and there will probably be
    > hundreds of bugs in it, but the important thing is that you'll save three
    > or four milliseconds at runtime.


    > Right?


    Correct. I didn't quite see the issue as assembly vs. python, having
    direct translation to programming hours. The structure in mind is meant
    to act as a dictionary to extend my db with a few table fields that
    could vary from one record to another and won't be queried for.
    Considering everytime my record is loaded, it pickle or marshal data
    has to be decoded, I figured the faster alternative should be better.
    With the incompatibility issue, I figured the day I upgrade my python,
    I would write a python script to upgrade the data. I take my word back.

    > Your database either can handle binary data, or it can't.


    It can. It's my web framework that doesn't.

    > If it can, then just use pickle with a binary protocol and be done with it.


    That I will do.

    > Either way, you have to find a way to translate your Python data
    > structures into something that you can feed to the database. Your database
    > can't automatically suck data structures out of Python's working memory!
    > So why re-invent the wheel? marshal is not recommended, but if you can
    > live with the limitations of marshal then it might do the job. But trying
    > to optimise code that hasn't even been written yet is a sure way to
    > trouble.


    Thanks. Will do.

    Regards,
    Mike
     
    Mike, Jan 15, 2006
    #16
  17. Mike

    Mike Guest

    > Even faster would be to write your code in assembly, and dump that
    > ridiculously bloated database and just write everything to raw bytes on
    > an unformatted disk. Of course, it might take the programmer a thousand
    > times longer to actually write the program, and there will probably be
    > hundreds of bugs in it, but the important thing is that you'll save three
    > or four milliseconds at runtime.


    > Right?


    Correct. I didn't quite see the issue as assembly vs. python, having
    direct translation to programming hours. The structure in mind is meant
    to act as a dictionary to extend my db with a few table fields that
    could vary from one record to another and won't be queried for.
    Considering everytime my record is loaded, it pickle or marshal data
    has to be decoded, I figured the faster alternative should be better.
    With the incompatibility issue, I figured the day I upgrade my python,
    I would write a python script to upgrade the data. I take my word back.

    > Your database either can handle binary data, or it can't.


    It can. It's my web framework that doesn't.

    > If it can, then just use pickle with a binary protocol and be done with it.


    That I will do.

    > Either way, you have to find a way to translate your Python data
    > structures into something that you can feed to the database. Your database
    > can't automatically suck data structures out of Python's working memory!
    > So why re-invent the wheel? marshal is not recommended, but if you can
    > live with the limitations of marshal then it might do the job. But trying
    > to optimise code that hasn't even been written yet is a sure way to
    > trouble.


    Thanks. Will do.

    Regards,
    Mike
     
    Mike, Jan 15, 2006
    #17
  18. Mike

    Mike Guest

    > Well, the Cookbook isn't an exhaustive list of everything you can do
    > with Python, it's just a record of some of the things people *have* done.


    Considering I am a newbie, it's a good start for me...

    > I presume your database has no datatype that will store binary data of
    > indeterminate length? Clearly that would be the most satisfactory solution.


    PostgreSQL. I think the only two thing it doesn't do is wash my car and
    code my software. Well, that's up until you use it in conjunction with
    Django, then the only work left is to wash my car, which I can't care
    less either. We'll wait for some rain :)

    Mike
     
    Mike, Jan 15, 2006
    #18
  19. Mike

    Steve Holden Guest

    Mike wrote:
    >>Well, the Cookbook isn't an exhaustive list of everything you can do
    >>with Python, it's just a record of some of the things people *have* done.

    >
    >
    > Considering I am a newbie, it's a good start for me...
    >
    >
    >>I presume your database has no datatype that will store binary data of
    >>indeterminate length? Clearly that would be the most satisfactory solution.

    >
    >
    > PostgreSQL. I think the only two thing it doesn't do is wash my car and
    > code my software. Well, that's up until you use it in conjunction with
    > Django, then the only work left is to wash my car, which I can't care
    > less either. We'll wait for some rain :)
    >

    So this question was primarily theoretical, right?

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
     
    Steve Holden, Jan 15, 2006
    #19
  20. Mike

    Mike Guest

    > So this question was primarily theoretical, right?

    Theoretical? not really Steve. I wanted to use django's wonderful db
    framework to save a structure into my postgresql. Except there is no
    direct BLOB support for it yet. There, I was trying to explore my
    options with saving this structure in clear text.

    Thanks,
    Mike
     
    Mike, Jan 15, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Kamoski
    Replies:
    3
    Views:
    15,702
    Jay B. Harlow [MVP - Outlook]
    Aug 9, 2003
  2. Matthew Thorley

    How do you convert a string obj to a file obj?

    Matthew Thorley, May 4, 2005, in forum: Python
    Replies:
    7
    Views:
    504
    Peter Otten
    May 4, 2005
  3. Replies:
    10
    Views:
    537
    Aaron Watters
    Jun 18, 2008
  4. Michael Davis

    Ruby 1.8 and Marshal.load/Marshal.dump

    Michael Davis, Oct 10, 2003, in forum: Ruby
    Replies:
    0
    Views:
    171
    Michael Davis
    Oct 10, 2003
  5. Hampton
    Replies:
    3
    Views:
    180
    Hampton
    Nov 27, 2005
Loading...

Share This Page