Marshal Obj is String or Binary?

M

Mike

Hi,

The example below shows that result of a marshaled data structure is
nothing but a string
'{i\x02\x00\x00\x00t\x03\x00\x00\x00twoi\x03\x00\x00\x00t\x05\x00\x00\x00three0'

Now, I need to store this data safely in my database as CLEAR TEXT, not
BLOB. It seems to me that it should work just fine since it is string
anyways. So, why does O'reilly's Python Cookbook is insisting in saving
it as a binary file and BLOB type?

Am I missing out something?

Thanks,
Mike
 
M

Marc 'BlackJack' Rintsch

The example below shows that result of a marshaled data structure is
nothing but a string

'{i\x02\x00\x00\x00t\x03\x00\x00\x00twoi\x03\x00\x00\x00t\x05\x00\x00\x00three0'

Now, I need to store this data safely in my database as CLEAR TEXT, not
BLOB. It seems to me that it should work just fine since it is string
anyways. So, why does O'reilly's Python Cookbook is insisting in saving
it as a binary file and BLOB type?

Am I missing out something?

Yes, that a string is *binary* data. But only a subset of strings is safe
to use as `TEXT` in databases. Do you see all those '\x??' escapes?
'\x00' is *one* byte! A byte with the value zero. Something your DB
doesn't allow in a `TEXT` type.

Ciao,
Marc 'BlackJack' Rintsch
 
M

Mike

Wait a sec. \x00 may represent a byte when unmarshaled, but as long as
marshal likes it as \x00, I think my db is capable of storing \ x 0 0
characters. What is the problem? Is it that \? I could escape that...
actually I think my django framework already does that for me.

Thanks,
Mike
 
M

Mike

Wait a sec. \x00 may represent a byte when unmarshaled, but as long as
marshal likes it as \x00, I think my db is capable of storing \ x 0 0
characters. What is the problem? Is it that \? I could escape that...
actually I think my django framework already does that for me.

Thanks,
Mike
 
C

casevh

Try...

What you see isn't always what you have. Your database is capable of
storing \ x 0 0 characters, but your string contains a single byte of
value zero. When Python displays the string representation to you, it
escapes the values so they can be displayed.

casevh
 
G

Giovanni Bajo

Try...


What you see isn't always what you have. Your database is capable of
storing \ x 0 0 characters, but your string contains a single byte of
value zero. When Python displays the string representation to you, it
escapes the values so they can be displayed.

He can still store the repr of the string into the database, and then
reconstruct it with eval:
bytes = "\x00\x01\x02"
bytes '\x00\x01\x02'
len(bytes) 3
ord(bytes[0]) 0
rb = repr(bytes)
rb "'\\x00\\x01\\x02'"
len(rb) 14
rb[0] "'"
rb[1] '\\'
rb[2] 'x'
rb[3] '0'
rb[4] '0'
bytes2 = eval(rb)
bytes == bytes2
True
 
M

Mike

Thanks everyone. It seems broken storing complex structures as escaped
strings, but I think I'll take my changes.

Thanks,
Mike
 
S

Steven D'Aprano

Thanks everyone. It seems broken storing complex structures as escaped
strings, but I think I'll take my changes.


Have you read the marshal reference?

http://docs.python.org/lib/module-marshal.html

marshal doesn't store data as escaped strings, it stores them as binary
strings. When you print the binary string to the console, unprintable
characters are shown escaped.

I'm guessing you probably want to use pickle instead of marshal. marshal
is intended only for dealing with .pyc files, and has some important
limitations. pickle is intended to be a general purpose serializer.
 
M

Max

Giovanni said:
He can still store the repr of the string into the database, and then
reconstruct it with eval:

Yes, but len(repr('\x00')) is 4, while len('\x00') is 1. So if he uses
BLOB his data will take almost a quarter of the space, compared to your
method (stored as TEXT).

--Max
 
S

Steven D'Aprano

Yes, but len(repr('\x00')) is 4, while len('\x00') is 1.
Incorrect:
"'\\x00'"



So if he uses
BLOB his data will take almost a quarter of the space, compared to your
method (stored as TEXT).

Also incorrect. That depends utterly on which particular characters end up
in the serialised data. You may or may not be able to predict what that
mix may be.

# nothing but printable data
s = ''.join(['a' for i in range(256)])
len(s) 256
len(repr(s))
258


# nothing but unprintable data
s = ''.join(['\0' for i in range(256)])
len(s) 256
len(repr(s))
1026


# one particular mix of both printable and unprintable data
s = ''.join([chr(i) for i in range(256)])
len(s) 256
len(repr(s))
737


# a different mix of both printable and unprintable data
s = '+'.join([chr(i) for i in range(128)])
len(s) 255
len(repr(s))
352
 
G

Giovanni Bajo

Max said:
Yes, but len(repr('\x00')) is 4, while len('\x00') is 1. So if he uses
BLOB his data will take almost a quarter of the space, compared to
your method (stored as TEXT).

Sure, but he didn't ask for the best strategy to store the data into the
database, he specified very clearly that he *can't* use BLOB, and asked how to
tuse TEXT.
 
M

Mike

Thanks everyone.

Why Marshal & not Pickle: Well, Marshal is supposed to be faster. But
then, if I wanted to do the whole repr()-eval() hack, I am already
defeating the purpose by refusing to save bytes as bytes in terms of
both size and speed.

At this point, I am considering one of the following:
- Save my structure as binary data, and reference the file from my db
- Find a clean method of saving bytes into my db

Thanks again,
Mike
 
M

Mike Meyer

Giovanni Bajo said:
He can still store the repr of the string into the database, and then
reconstruct it with eval:

repr and eval are overkill for this, and as as result create a
security hole. Using encode('string-escape') and
decode('string-escape') will do the same job without the security
hole:
bytes = '\x00\x01\x02'
bytes '\x00\x01\x02'
ord(bytes[0]) 0
rb = bytes.encode('string-escape')
rb '\\x00\\x01\\x02'
len(rb) 12
rb[0] '\\'
bytes2 = rb.decode('string-escape')
bytes == bytes2 True

<mike
 
S

Steven D'Aprano

Thanks everyone.

Why Marshal & not Pickle: Well, Marshal is supposed to be faster.

Faster than cPickle?

Even faster would be to write your code in assembly, and dump that
ridiculously bloated database and just write everything to raw bytes on
an unformatted disk. Of course, it might take the programmer a thousand
times longer to actually write the program, and there will probably be
hundreds of bugs in it, but the important thing is that you'll save three
or four milliseconds at runtime.

Right?

Unless you've actually done proper measurements of the time taken, with
realistic sample data, worrying about saving a byte here and a
millisecond there is just wasting your time, and is often
counter-productive. Optimization without measurement is as likely to
result in slower, fatter performance as it is faster and leaner.

marshal is not designed to be portable across versions. Do you *really*
think it is a good idea to tie the data in your database to one specific
version of Python?

But
then, if I wanted to do the whole repr()-eval() hack, I am already
defeating the purpose by refusing to save bytes as bytes in terms of
both size and speed.

At this point, I am considering one of the following:
- Save my structure as binary data, and reference the file from my db
- Find a clean method of saving bytes into my db

Your database either can handle binary data, or it can't.

If it can, then just use pickle with a binary protocol and be done with it.

If it can't, then just use pickle with a plain text protocol and be done
with it.

Either way, you have to find a way to translate your Python data
structures into something that you can feed to the database. Your database
can't automatically suck data structures out of Python's working memory!
So why re-invent the wheel? marshal is not recommended, but if you can
live with the limitations of marshal then it might do the job. But trying
to optimise code that hasn't even been written yet is a sure way to
trouble.
 
S

Steve Holden

Mike said:
Hi,

The example below shows that result of a marshaled data structure is
nothing but a string



'{i\x02\x00\x00\x00t\x03\x00\x00\x00twoi\x03\x00\x00\x00t\x05\x00\x00\x00three0'

Now, I need to store this data safely in my database as CLEAR TEXT, not
BLOB. It seems to me that it should work just fine since it is string
anyways. So, why does O'reilly's Python Cookbook is insisting in saving
it as a binary file and BLOB type?
Well, the Cookbook isn't an exhaustive list of everything you can do
with Python, it's just a record of some of the things people *have* done.

I presume your database has no datatype that will store binary data of
indeterminate length? Clearly that would be the most satisfactory solution.

regards
Steve
 
M

Mike

Even faster would be to write your code in assembly, and dump that
ridiculously bloated database and just write everything to raw bytes on
an unformatted disk. Of course, it might take the programmer a thousand
times longer to actually write the program, and there will probably be
hundreds of bugs in it, but the important thing is that you'll save three
or four milliseconds at runtime.

Correct. I didn't quite see the issue as assembly vs. python, having
direct translation to programming hours. The structure in mind is meant
to act as a dictionary to extend my db with a few table fields that
could vary from one record to another and won't be queried for.
Considering everytime my record is loaded, it pickle or marshal data
has to be decoded, I figured the faster alternative should be better.
With the incompatibility issue, I figured the day I upgrade my python,
I would write a python script to upgrade the data. I take my word back.
Your database either can handle binary data, or it can't.

It can. It's my web framework that doesn't.
If it can, then just use pickle with a binary protocol and be done with it.

That I will do.
Either way, you have to find a way to translate your Python data
structures into something that you can feed to the database. Your database
can't automatically suck data structures out of Python's working memory!
So why re-invent the wheel? marshal is not recommended, but if you can
live with the limitations of marshal then it might do the job. But trying
to optimise code that hasn't even been written yet is a sure way to
trouble.

Thanks. Will do.

Regards,
Mike
 
M

Mike

Even faster would be to write your code in assembly, and dump that
ridiculously bloated database and just write everything to raw bytes on
an unformatted disk. Of course, it might take the programmer a thousand
times longer to actually write the program, and there will probably be
hundreds of bugs in it, but the important thing is that you'll save three
or four milliseconds at runtime.

Correct. I didn't quite see the issue as assembly vs. python, having
direct translation to programming hours. The structure in mind is meant
to act as a dictionary to extend my db with a few table fields that
could vary from one record to another and won't be queried for.
Considering everytime my record is loaded, it pickle or marshal data
has to be decoded, I figured the faster alternative should be better.
With the incompatibility issue, I figured the day I upgrade my python,
I would write a python script to upgrade the data. I take my word back.
Your database either can handle binary data, or it can't.

It can. It's my web framework that doesn't.
If it can, then just use pickle with a binary protocol and be done with it.

That I will do.
Either way, you have to find a way to translate your Python data
structures into something that you can feed to the database. Your database
can't automatically suck data structures out of Python's working memory!
So why re-invent the wheel? marshal is not recommended, but if you can
live with the limitations of marshal then it might do the job. But trying
to optimise code that hasn't even been written yet is a sure way to
trouble.

Thanks. Will do.

Regards,
Mike
 
M

Mike

Well, the Cookbook isn't an exhaustive list of everything you can do
with Python, it's just a record of some of the things people *have* done.

Considering I am a newbie, it's a good start for me...
I presume your database has no datatype that will store binary data of
indeterminate length? Clearly that would be the most satisfactory solution.

PostgreSQL. I think the only two thing it doesn't do is wash my car and
code my software. Well, that's up until you use it in conjunction with
Django, then the only work left is to wash my car, which I can't care
less either. We'll wait for some rain :)

Mike
 
S

Steve Holden

Mike said:
Considering I am a newbie, it's a good start for me...




PostgreSQL. I think the only two thing it doesn't do is wash my car and
code my software. Well, that's up until you use it in conjunction with
Django, then the only work left is to wash my car, which I can't care
less either. We'll wait for some rain :)
So this question was primarily theoretical, right?

regards
Steve
 
M

Mike

So this question was primarily theoretical, right?

Theoretical? not really Steve. I wanted to use django's wonderful db
framework to save a structure into my postgresql. Except there is no
direct BLOB support for it yet. There, I was trying to explore my
options with saving this structure in clear text.

Thanks,
Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top