Pickled text file causing ValueError (dos/unix issue)

A

Aki Niimura

Hello everyone,

I started to use pickle to store the latest user settings for the tool
I wrote. It writes out a pickled text file when it terminates and it
restores the settings when it starts.

It worked very nicely.

However, I got a ValueError when I started the tool from Unix when I
previously used the tool from Windows.

File "/usr/local/lib/python2.3/pickle.py", line 980, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle

If I do 'dos2unix <my.cfg> <my.cfg>' to convert the file, then
everything
becomes fine.

I found in the Python release note saying ...
"pickle: Now raises ValueError when an invalid pickle that contains a
non-string repr where a string repr was expected. This behavior matches
cPickle."

I guess DOS text format is creating this problem.
My question is "Is there any elegant way to deal with this?".

I certainly can catch ValueError and run 'dos2unix' explicitly.
But I don't like such crude solution.
Any suggestions would be highly appreciated.

Best regards,
Aki Niimura
 
T

Tim Peters

[Aki Niimura]
I started to use pickle to store the latest user settings for the tool
I wrote. It writes out a pickled text file when it terminates and it
restores the settings when it starts. ....
I guess DOS text format is creating this problem.
Yes.

My question is "Is there any elegant way to deal with this?".

Yes: regardless of platform, always open files used for pickles in
binary mode. That is, pass "rb" to open() when reading a pickle file,
and "wb" to open() when writing a pickle file. Then your pickle files
will work unchanged on all platforms. The same is true of files
containing binary data of any kind (and despite that pickle protocol 0
was called "text mode" for years, it's still binary data).
 
I

Irmen de Jong

Tim said:
Yes: regardless of platform, always open files used for pickles in
binary mode. That is, pass "rb" to open() when reading a pickle file,
and "wb" to open() when writing a pickle file. Then your pickle files
will work unchanged on all platforms. The same is true of files
containing binary data of any kind (and despite that pickle protocol 0
was called "text mode" for years, it's still binary data).

I've been wondering why there even is the choice between binary mode
and text mode. Why can't we just do away with the 'text mode' ?
What does it do, anyways? At least, if it does something, I'm sure
that it isn't something that can be done in Python itself if
really required to do so...

--Irmen
 
T

Tim Peters

[Irmen de Jong]
I've been wondering why there even is the choice between binary mode
and text mode. Why can't we just do away with the 'text mode' ?
What does it do, anyways? At least, if it does something, I'm sure
that it isn't something that can be done in Python itself if
really required to do so...

It's not Python's decision, it's the operating system's. Whether
there's an actual difference between text mode and binary mode is up
to the operating system, and, if there is an actual difference, every
detail about what the difference(s) consists of is also up to the
operating system. That differences may exist is reflected in the C
standard, and the rules for text-mode files are more restrictive than
most people would believe.

On Unixish systems, there's no difference. On Windows boxes, there
are conceptually small differences with huge consequences, and the
distinction appears to be kept just for backward-compatibility
reasons. On some other systems, text and binary files are entirely
different kinds of beasts.

If Python didn't offer text mode then it would be clumsy at best to
use Python to write ordinary human-readable text files in the format
that native software on Windows, and Mac Classic, and VAX (and ...)
expects (and the native format for text mode differs across all of
them). If Python didn't offer binary mode then it wouldn't be
possible to use Python to process data in binary files on Windows and
Mac Classic and VAX (and ...). If Python used its own
platform-independent file format, then it would end up creating files
that other programs wouldn't be able to deal with.

Live with it <wink>.
 
S

Serge Orlov

Irmen said:
I've been wondering why there even is the choice between binary mode
and text mode. Why can't we just do away with the 'text mode' ?

We can't because characters and bytes are not the same things. But I
believe what you're really complaining about is that "t" mode sometimes
mysteriously corrupts data if processed by the code that expects binary
files. In Python 3.0 it will be fixed because file.read will have to
return different objects: bytes for "b" mode, str for "t" mode. It
would be great if file type was split into binfile and textfile,
removing need for cryptic "b" and "t" modes but I'm afraid that's too
much of a change even for Python 3.0

Serge.
 
I

Irmen de Jong

Tim said:
That differences may exist is reflected in the C
standard, and the rules for text-mode files are more restrictive than
most people would believe.

Apparently. Because I know only about the Unix <-> Windows difference
(windows converts \r\n <--> \n when using 'r' mode, right).
So it's in the line endings.

Is there more obscure stuff going on on the other systems you
mentioned (Mac OS, VAX) ?

(That means that the bug in Simplehttpserver that my patch
839496 addressed, also occured on those systems? Or that
the patch may be incorrect after all??)

While your argument about why Python doesn't use its own platform-
independent file format is sound ofcourse, I find it often a nuisance
that platform specific things tricle trough into Python itself and
ultimately in the programs you write. I sometimes feel that some
parts of Python expose the underlying C/os implementation
a bit too much. Python never claimed write once run anywhere (as
that other language does) but it would have been nice nevertheless ;-)
In practice it's just not possible I guess.

Thanks,
--Irmen
 
J

John Machin

[Aki Niimura]
I started to use pickle to store the latest user settings for the tool
I wrote. It writes out a pickled text file when it terminates and it
restores the settings when it starts. ...
I guess DOS text format is creating this problem.
Yes.

My question is "Is there any elegant way to deal with this?".

Yes: regardless of platform, always open files used for pickles in
binary mode. That is, pass "rb" to open() when reading a pickle file,
and "wb" to open() when writing a pickle file. Then your pickle files
will work unchanged on all platforms. The same is true of files
containing binary data of any kind (and despite that pickle protocol 0
was called "text mode" for years, it's still binary data).

Tim, the manual as of version 2.4 does _not_ mention the need to use
'b' on OSes where it makes a difference, not even in the examples at
the end of the chapter. Further, it still refers to protocol 0 as
'text' in several places. There is also a reference to protocol 0
files being viewable in a text editor.

In other words, enough to lead even the most careful Reader of TFM up
the garden path :)

Cheers,
John
 
T

Tim Peters

[Tim Peters]
[John Machin]
Tim, the manual as of version 2.4 does _not_ mention the need
to use 'b' on OSes where it makes a difference, not even in the
examples at the end of the chapter. Further, it still refers to
protocol 0 as 'text' in several places. There is also a reference to
protocol 0 files being viewable in a text editor.

In other words, enough to lead even the most careful Reader of
TFM up the garden path :)

Take the next step: submit a patch with corrected text. I'm not paid
to work on the Python docs either <0.5 wink>. (BTW, protocol 0 files
are viewable in a text editor regardless, although the line ends may
"look funny")
 
T

Tim Peters

[Tim Peters]
[Irmen de Jong]
Apparently. Because I know only about the Unix <-> Windows
difference (windows converts \r\n <--> \n when using 'r' mode,
right). So it's in the line endings.

That's one difference. The worse difference is that, in text mode on
Windows, the first instance of chr(26) in a file is taken as meaning
"that's the end of the file", no matter how many bytes may follow it.
That's fine by the C standard, because everything about a text-mode
file containing a chr(26) character is undefined.
Is there more obscure stuff going on on the other systems you
mentioned (Mac OS, VAX) ?

I think on Mac Classic it was *just* line end differences. Native VAX
has many file formats. "Record-based" file formats used to be very
popular. There the OS saves meta-information in the file, such as
each record contains an offset to the start of the next record, and
may even contain an index structure to support random access to
records quickly (for example, "a line" may be a record, and "read the
last line" may go quickly). Read that in binary mode, and you'll be
reading up the bits in the index and offsets too, etc. IIRC, Unix was
actually quite novel at the time in insisting that all files were just
raw byte streams to the OS.
(That means that the bug in Simplehttpserver that my patch
839496 addressed, also occured on those systems? Or that
the patch may be incorrect after all??)

Don't know, and (sorry) no time to dig.
While your argument about why Python doesn't use its own
platform- independent file format is sound of course, I find it often
a nuisance that platform specific things tricle trough into Python
itself and ultimately in the programs you write. I sometimes feel
that some parts of Python expose the underlying C/os
implementation a bit too much. Python never claimed write once
run anywhere (as that other language does) but it would have
been nice nevertheless ;-)
In practice it's just not possible I guess.

It would be difficult at best. Python hides a lot of platform crap,
but generally where it's reasonably easy to hide. It's not easy to
hide native file conventions, partly because Python wouldn't play well
with *other* platform software if it did.

Remember that Guido worked on ABC before Python, and Python is in
(small) part a reaction against the extremes of ABC. ABC was 100%
platform-independent. You could read and write files from ABC.
However, the only files you could read from ABC were files that were
written by ABC -- and files written by ABC were essentially unusable
by other software. Socket semantics were also 100% portable in ABC:
it didn't have sockets, nor any way to extend the language to add
them. Etc -- ABC was a self-contained universe. "Plays well with
others" was a strong motivator for Python's design, and that often
means playing by others' rules.
 
S

Skip Montanaro

Tim> "Plays well with others" was a strong motivator for Python's
Tim> design, and that often means playing by others' rules. --

My vote for QOTW... Is it too late to slip it into the Zen of Python?

Skip
 
C

Cameron Laird

.
.
.
reading up the bits in the index and offsets too, etc. IIRC, Unix was
actually quite novel at the time in insisting that all files were just
raw byte streams to the OS.
Not just "novel", but "puzzling" and even "controversial".
It was far from clear that the Unix way could be successful.
.
.
.
but generally where it's reasonably easy to hide. It's not easy to
hide native file conventions, partly because Python wouldn't play well
with *other* platform software if it did.

Remember that Guido worked on ABC before Python, and Python is in
(small) part a reaction against the extremes of ABC. ABC was 100%
platform-independent. You could read and write files from ABC.
However, the only files you could read from ABC were files that were
written by ABC -- and files written by ABC were essentially unusable
by other software. Socket semantics were also 100% portable in ABC:
it didn't have sockets, nor any way to extend the language to add
them. Etc -- ABC was a self-contained universe. "Plays well with
others" was a strong motivator for Python's design, and that often
means playing by others' rules.

At a slightly different level, that--not playing well enough
with others--is what held Smalltalk back. Again, a lot of
this stuff wasn't obvious at the time, even as late as 1990.
I think we understand better now that languages are secondary,
in that good developers can be productive with all sorts of
syntaxes and semantics; as a practical matter, daily struggles
have to do with the libraries or how the languages access what
is outside themselves.
 
N

Nick Coghlan

Skip said:
Tim> "Plays well with others" was a strong motivator for Python's
Tim> design, and that often means playing by others' rules. --

My vote for QOTW... Is it too late to slip it into the Zen of Python?

It would certainly fit, and the existing koans don't really cover the concept.

Its addition also seems fitting in light of the current PEP 246 discussion which
is *all* about playing well with others :)

Cheers,
Nick.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top