why is bytearray treated so inefficiently by pickle?

I

Irmen de Jong

Hi,

A bytearray is pickled (using max protocol) as follows:
pickletools.dis(pickle.dumps(bytearray([255]*10),2))
0: \x80 PROTO 2
2: c GLOBAL '__builtin__ bytearray'
25: q BINPUT 0
27: X BINUNICODE u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
52: q BINPUT 1
54: U SHORT_BINSTRING 'latin-1'
63: q BINPUT 2
65: \x86 TUPLE2
66: q BINPUT 3
68: R REDUCE
69: q BINPUT 4
71: . STOP
(<type 'bytearray'>, (u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff', 'latin-1'), None)


Is there a particular reason it is encoded so inefficiently? Most notably, the actual
*bytes* in the bytearray are represented by an UTF-8 string. This needs to be
transformed into a unicode string and then encoded back into bytes, when unpickled. The
thing being a bytearray, I would expect it to be pickled as such: a sequence of bytes.
And then possibly converted back to bytearray using the constructor that takes the bytes
directly (BINSTRING/BINBYTES pickle opcodes).

The above occurs both on Python 2.x and 3.x.

Any ideas? Candidate for a patch?


Irmen.
 
T

Terry Reedy

Hi,

A bytearray is pickled (using max protocol) as follows:
pickletools.dis(pickle.dumps(bytearray([255]*10),2))
0: \x80 PROTO 2
2: c GLOBAL '__builtin__ bytearray'
25: q BINPUT 0
27: X BINUNICODE u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
52: q BINPUT 1
54: U SHORT_BINSTRING 'latin-1'
63: q BINPUT 2
65: \x86 TUPLE2
66: q BINPUT 3
68: R REDUCE
69: q BINPUT 4
71: . STOP
(<type 'bytearray'>, (u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff', 'latin-1'), None)


Is there a particular reason it is encoded so inefficiently? Most notably, the actual
*bytes* in the bytearray are represented by an UTF-8 string. This needs to be
transformed into a unicode string and then encoded back into bytes, when unpickled. The
thing being a bytearray, I would expect it to be pickled as such: a sequence of bytes.
And then possibly converted back to bytearray using the constructor that takes the bytes
directly (BINSTRING/BINBYTES pickle opcodes).

The above occurs both on Python 2.x and 3.x.

Any ideas? Candidate for a patch?

Possibly. The two developers listed as particularly interested in pickle
are 'alexandre.vassalotti,pitrou' (antoine), so if you do open a tracker
issue, add them as nosy.

Take a look at http://www.python.org/dev/peps/pep-3154/
by Antoine Pitrou or forwary your message to him.
 
I

Irmen de Jong

Possibly. The two developers listed as particularly interested in pickle are
'alexandre.vassalotti,pitrou' (antoine), so if you do open a tracker issue, add them as
nosy.

Take a look at http://www.python.org/dev/peps/pep-3154/
by Antoine Pitrou or forwary your message to him.

Created a bug report + patches, http://bugs.python.org/issue13503
I've read the PEP, thanks, it was interesting. But I don't think my changes require a
new pickle protocol version bump.

Irmen
 
J

John Ladasky

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.

I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.
 
I

Irmen de Jong

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.

I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.

Python provides ample ways for custom types to influence the way they're
pickled (getstate/setstate, reduce).

Are numpy's arrays are pickled similar to Python's own array types? In
that case, when using Python 2.x, they're pickled very inefficiently
indeed (every element is encoded with its own token). In Python 3.x,
array pickling is very efficient because it stores the machine type
representation in the pickle.


Irmen
 
R

Robert Kern

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.

I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.

It is. Use protocol=HIGHEST_PROTOCOL when dumping the array to a pickle.

[~]
|1> big = np.linspace(0.0, 1.0, 500000)

[~]
|2> import cPickle

[~]
|3> len(cPickle.dumps(big))
11102362

[~]
|4> len(cPickle.dumps(big, protocol=cPickle.HIGHEST_PROTOCOL))
4000135


The original conception for pickle was that it would have an ASCII
representation for optimal cross-platform compatibility. These were the days
when people still used FTP regularly, and you could easily (and silently!) screw
up binary data if you sent it in ASCII mode by accident. This necessarily
creates large files for numpy arrays. Further iterations on the pickling
protocol let numpy use raw binary data in the pickle. However, for backwards
compatibility, the default protocol is the one Python started out with. If you
explicitly use the most recent protocol, then you will get the efficiency benefits.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
I

Irmen de Jong

On Nov 30, Irmen de Jong opened a tracker issue with a patch improve bytearray pickling.
http://bugs.python.org/issue13503

Yesterday, Dec 5, Antoine Pitrou applied a revised fix.
http://hg.python.org/cpython/rev/e2959a6a1440/
The commit message:
"Issue #13503: Use a more efficient reduction format for bytearrays with pickle protocol
under Python 2."

Sure, but this patch only improved the pickle behavior of the bytearray type for
protocol level 3. It didn't touch Python 2.x, nor the pickling of arrays (array.array),
let alone numpy arrays.

Irmen
 

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top