The Art of Pickling: Binary vs Ascii difficulties

B

Bix

As this is my very first post, I'd like to give thanks to all who
support this with their help. Hopefully, this question hasn't been
answered (too many times) before...

If anyone could explain this behavior, I'd greatly appreciate it.

I'm leaving the example at the bottom. There is a variable, fmt,
within the test0 function which can be changed from -1
(pickle.HIGHEST_PROTOCOL) to 0 (ascii). The behavior between the two
pickle formats is not consistent. I'm hoping for an explaination and
a possible solution; I'd like to store my data in binary.

Thanks in advance!


# example.py
import pickle
class node (object):
def __init__ (self, *args, **kwds):
self.args = args
self.kwds = kwds
self.reset()

def reset(self):
self.name = None
self.node = 'node'
self.attributes = {}
self.children = []
self.update(*self.args,**self.kwds)

def update(*args,**kwds):
for k,v in kwds.items():
if k in self.__dict__.keys():
self.__dict__[k] = v

def test0 (x,fmt=-1):
fn = 'out.bin'
pickle.Pickler(open(fn,'w'),fmt).dump(x)
obj = pickle.Unpickler(open(fn,'r')).load()
return obj

def test1 ():
x = node()
return test0(x)

def test2 ():
x = node()
y = node()
x.children.append(y)
return test0(x)

def test3 ():
w = node()
x = node()
y = node()
z = node()
w.children.append(x)
x.children.append(y)
y.children.append(z)
return test0(w)

def test4 ():
w = node()
x = node()
y = node()
z = node()
w.children.append(x)
x.children.append(y)
y.children.append(z)
return test0(w,0)

def makeAttempt(call,name):
try:
call()
print '%s passed' % name
except:
print '%s failed' % name

if __name__ == "__main__":
makeAttempt(test1,'test1') # should run
makeAttempt(test2,'test2') # should run
makeAttempt(test3,'test3') # should fail
makeAttempt(test4,'test4') # should run
 
J

Josiah Carlson

I'm leaving the example at the bottom. There is a variable, fmt,
within the test0 function which can be changed from -1
(pickle.HIGHEST_PROTOCOL) to 0 (ascii). The behavior between the two
pickle formats is not consistent. I'm hoping for an explaination and
a possible solution; I'd like to store my data in binary.

If you want to store data in binary, and are running on windows, you
must make sure to open all files with the binary flag, 'b'.
pickle.Pickler(open(fn,'w'),fmt).dump(x)
obj = pickle.Unpickler(open(fn,'r')).load()

The above should be open(fn, 'wb')... and open(fn, 'rb')... respectively.

Changing those two made all of them pass for me, and I would expect no
less.

Oh, and so that you get into the habit early; tabs are frowned upon as
indentation in Python. The standard is 4 spaces, no tabs.

- Josiah
 
A

Andrew Dalke

Bix said:
I'm leaving the example at the bottom. There is a variable, fmt,
within the test0 function which can be changed from -1
(pickle.HIGHEST_PROTOCOL) to 0 (ascii). The behavior between the two
pickle formats is not consistent. I'm hoping for an explaination and
a possible solution; I'd like to store my data in binary.

What's the inconsistancy?

Ahh, I see the comments down at the end of your file. I
assume you think they should all pass?

They all pass for me.

I'll guess you're on MS Windows. You need to open the file
in binary mode instead of ascii, which is the default.
Try changing

pickle.Pickler(open(fn,'w'),fmt).dump(x)
obj = pickle.Unpickler(open(fn,'r')).load()

to

pickle.Pickler(open(fn,'wb'),fmt).dump(x)
obj = pickle.Unpickler(open(fn,'rb')).load()


This isn't clear in the documentation, as Skip complained
about last year in the thread starting at
http://mail.python.org/pipermail/python-dev/2003-February/033362.html

Though to be precise, this isn't actually a pickle
issue.

Andrew
(e-mail address removed)
 
S

Scott David Daniels

Bix said:
As this is my very first post, I'd like to give thanks to all who
support this with their help. Hopefully, this question hasn't been
answered (too many times) before...
> If anyone could explain this behavior, I'd greatly appreciate it.

You clearly spent some effort on this, but you could have boiled this
down to a smaller, more direct question.

The short answer is, "when reading and/or writing binary data,
the files must be opened in binary." Pickles in "ascii" are not
in a binary format, but the others are.

The longer answer includes:
You should handle files a bit more carefully. Don't presume they get
automatically get closed.
I'd change:
> fn = 'out.bin'
> pickle.Pickler(open(fn,'w'),fmt).dump(w)
> obj = pickle.Unpickler(open(fn,'r')).load()
to:
fn = 'out.bin'
dest = open(fn, 'w')
try:
pickle.Pickler(dest, fmt).dump(w)
finally:
dest.close()
source = open(fn, 'r')
try:
return pickle.Unpickler(source).load()
finally:
source.close()

Then the problem (the mode in which you open the file) shows up to a
practiced eye.
dest = open(fn, 'w') ... source = open(fn, 'r')
should either be:
dest = open(fn, 'wb') ... source = open(fn, 'rb')
which works "OK" for ascii, but is not in machine-native text format.
or:
if fmt:
readmode, writemode = 'rb', 'wb'
else:
readmode, writemode = 'r', 'b'
...
dest = open(fn, writemode) ... source = open(fn, readmode)

By the way, the reason that binary mode sometimes works (which is,
I suspect, what is troubling you), is that not all bytes are necessarily
written out as-is in text mode. On Windows and MS-DOS systems,
a byte with value 10 is written as a pair of bytes, 13 followed by 10.
On Apple systems, another translation happens. On unix (and hence
linux) there is no distinction between data written as text and the
C representation of '\n' for line breaks. This means nobody on linux
who ran your example saw a problem, I suspect.

This C convention is a violation of the ASCII code as it was then
defined, in order to save a byte per line (treating '\n' as end-of-line,
not line-feed). An ASCII-conforming printer when fed 'a\nb\nc\r\n.\r\n'
should print:
a
b
c
..

My idea of the right question would be, roughly:

Why does test(0) succeed (pickle format 0 = ascii),
but test(-1) fail (pickle format -1 = pickle.HIGHEST_PROTOCOL)?
I am using python 2.4 on Windows2000

import pickle
class node (object):
def __init__ (self, *args, **kwds):
self.args = args
self.kwds = kwds
self.reset()

def reset(self):
self.name = None
self.node = 'node'
self.attributes = {}
self.children = []
self.update(*self.args,**self.kwds)

def update(*args,**kwds):
for k,v in kwds.items():
if k in self.__dict__.keys():
self.__dict__[k] = v

def test(fmt=-1):
w = node()
x = node()
y = node()
z = node()
w.children.append(x)
x.children.append(y)
y.children.append(z)
fn = 'out.bin'
pickle.Pickler(open(fn,'w'),fmt).dump(w)
obj = pickle.Unpickler(open(fn,'r')).load()
return obj

The error message is:
Traceback (most recent call last):
File "<pyshell#24>", line 1, in -toplevel-
test()
File "<pyshell#22>", line 11, in test
obj = pickle.Unpickler(open(fn,'r')).load()
File "C:\Python24\lib\pickle.py", line 872, in load
dispatch[key](self)
File "C:\Python24\lib\pickle.py", line 1189, in load_binput
i = ord(self.read(1))
TypeError: ord() expected a character, but string of length 0 found


-Scott David Daniels
(e-mail address removed)
 
J

John Hunter

Andrew> Standards wonk that I am, I was curious about this. I've

Well, if you are a standards wonk and emacs user, you might have fun
with this little bit of python and emacs code. If you place rfc.py in
your PATH

#!/usr/bin/env python
# Print an RFC indicated by a command line arg to stdout
# > rfc.py 822

import urllib, sys

try: n = int(sys.argv[1])
except:
print 'Example usage: %s 822' % sys.argv[0]
sys.exit(1)

print urllib.urlopen('http://www.ietf.org/rfc/rfc%d.txt' % n).read()


and add this function to your .emacs

;;** RFC
(defun rfc (num)
"Insert RFC indicated by num into buffer *RFC<num>*"
(interactive "sRFC: ")
(shell-command
(concat "rfc.py " num)
(concat "*RFC" num "*")))


You can get rfc's in your emacs buffer by doing

M-x rfc ENTER 20 ENTER

And now back you our regularly scheduled work day.

JDH
 
A

Andrew Dalke

Scott David Daniels
This C convention is a violation of the ASCII code as it was then
defined, in order to save a byte per line (treating '\n' as end-of-line,
not line-feed). An ASCII-conforming printer when fed 'a\nb\nc\r\n.\r\n'
should print:
a
b
c
..

Standards wonk that I am, I was curious about this. I've
never read the ASCII spec before. In my somewhat cursory
search I couldn't find something authoritative on-line that
claimed to be "the" ASCII spec. I did find RFC 20 "ASCII
format for network interchange" dated October 16, 1969,
so before the C convention was defined. Here's one copy
http://www.faqs.org/rfcs/rfc20.html

It says


LF (Line Feed): A format effector which controls the movement of
the printing position to the next printing line. (Applicable also to
display devices.) Where appropriate, this character may have the
meaning "New Line" (NL), a format effector which controls the
movement of the printing point to the first printing position on the
next printing line. Use of this convention requires agreement
between sender and recipient of data.

So it seems that it's not a violation, just a convention.

It happens that MS Windows and Unix (and old Macs) have
different conventions.

Andrew
(e-mail address removed)
 
A

Andrew Dalke

John said:
Well, if you are a standards wonk and emacs user, you might have fun
with this little bit of python and emacs code. If you place rfc.py in
your PATH

Huh. Never really figured out how to customize Lisp.

Another solution is to use a browser like Konqueror which
lets users define new "protocols" so that "rfc:20"
expands to the given URL.

Very handy for Qt programming because I can have
"qt:textedit" expand to the documentation for that
module.

Most of the specs I read, btw, aren't RFCs.

Andrew
(e-mail address removed)
 
B

Bengt Richter

Andrew> Standards wonk that I am, I was curious about this. I've

Well, if you are a standards wonk and emacs user, you might have fun
with this little bit of python and emacs code. If you place rfc.py in
your PATH

#!/usr/bin/env python
# Print an RFC indicated by a command line arg to stdout
# > rfc.py 822

import urllib, sys

try: n = int(sys.argv[1])
except:
print 'Example usage: %s 822' % sys.argv[0]
sys.exit(1)

print urllib.urlopen('http://www.ietf.org/rfc/rfc%d.txt' % n).read()


and add this function to your .emacs

;;** RFC
(defun rfc (num)
"Insert RFC indicated by num into buffer *RFC<num>*"
(interactive "sRFC: ")
(shell-command
(concat "rfc.py " num)
(concat "*RFC" num "*")))


You can get rfc's in your emacs buffer by doing

M-x rfc ENTER 20 ENTER

And now back you our regularly scheduled work day.
Thanks. For win32 users with gvim I've modified it a little ...

---< vrfc.py >---------------------------------------
# vrfc.py
# to use in gvim on win32, put this file somewhere
# and put a vrfc.cmd file in one of your %PATH% directories
# (e.g. c:\util here) running python with a full path to this
# script (vrfc.py), e.g.,
# +--< vrfc.cmd >------------+
# |@python c:\util\vrfc.py %1|
# +--------------------------+
# (This cmd file is necessary on NT4 and some other windows platforms
# in order for the output to be pipe-able back into gvim (or anytwhere else)).
# Then you can insert an rfc into your current gvim editing using
# :r!vrfc n
# where n is the rfc number
# Form form feeds are converted to underline separators 78 chars wide
# and \r's if any are stripped for normalized output, in case.
#
import urllib, sys
try: n = int(sys.argv[1])
except:
print 'Example usage: python %s 822' % sys.argv[0]
sys.exit(1)
s = urllib.urlopen('http://www.ietf.org/rfc/rfc%d.txt' % n).read()
s = s.replace('\r','')
s = s.replace('\x0c','_'*78+'\n')
sys.stdout.write(s)
sys.stdout.close()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top