why is bytearray treated so inefficiently by pickle?

Irmen de Jong · Nov 27, 2011

Hi,

A bytearray is pickled (using max protocol) as follows:

pickletools.dis(pickle.dumps(bytearray([255]*10),2))

Click to expand...

Click to expand...

0: \x80 PROTO 2
2: c GLOBAL '__builtin__ bytearray'
25: q BINPUT 0
27: X BINUNICODE u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
52: q BINPUT 1
54: U SHORT_BINSTRING 'latin-1'
63: q BINPUT 2
65: \x86 TUPLE2
66: q BINPUT 3
68: R REDUCE
69: q BINPUT 4
71: . STOP
(<type 'bytearray'>, (u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff', 'latin-1'), None)

Is there a particular reason it is encoded so inefficiently? Most notably, the actual
*bytes* in the bytearray are represented by an UTF-8 string. This needs to be
transformed into a unicode string and then encoded back into bytes, when unpickled. The
thing being a bytearray, I would expect it to be pickled as such: a sequence of bytes.
And then possibly converted back to bytearray using the constructor that takes the bytes
directly (BINSTRING/BINBYTES pickle opcodes).

The above occurs both on Python 2.x and 3.x.

Any ideas? Candidate for a patch?

Irmen.

Terry Reedy · Nov 28, 2011

Hi,

A bytearray is pickled (using max protocol) as follows:

pickletools.dis(pickle.dumps(bytearray([255]*10),2))

Click to expand...

Click to expand...

0: \x80 PROTO 2
2: c GLOBAL '__builtin__ bytearray'
25: q BINPUT 0
27: X BINUNICODE u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
52: q BINPUT 1
54: U SHORT_BINSTRING 'latin-1'
63: q BINPUT 2
65: \x86 TUPLE2
66: q BINPUT 3
68: R REDUCE
69: q BINPUT 4
71: . STOP
(<type 'bytearray'>, (u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff', 'latin-1'), None)

Is there a particular reason it is encoded so inefficiently? Most notably, the actual
*bytes* in the bytearray are represented by an UTF-8 string. This needs to be
transformed into a unicode string and then encoded back into bytes, when unpickled. The
thing being a bytearray, I would expect it to be pickled as such: a sequence of bytes.
And then possibly converted back to bytearray using the constructor that takes the bytes
directly (BINSTRING/BINBYTES pickle opcodes).

The above occurs both on Python 2.x and 3.x.

Any ideas? Candidate for a patch?

Possibly. The two developers listed as particularly interested in pickle
are 'alexandre.vassalotti,pitrou' (antoine), so if you do open a tracker
issue, add them as nosy.

Take a look at http://www.python.org/dev/peps/pep-3154/
by Antoine Pitrou or forwary your message to him.

Irmen de Jong · Nov 30, 2011

Possibly. The two developers listed as particularly interested in pickle are
'alexandre.vassalotti,pitrou' (antoine), so if you do open a tracker issue, add them as
nosy.

Take a look at http://www.python.org/dev/peps/pep-3154/
by Antoine Pitrou or forwary your message to him.

Created a bug report + patches, http://bugs.python.org/issue13503
I've read the PEP, thanks, it was interesting. But I don't think my changes require a
new pickle protocol version bump.

Irmen

John Ladasky · Dec 6, 2011

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.

I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.

Irmen de Jong · Dec 6, 2011

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.

I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.

Python provides ample ways for custom types to influence the way they're
pickled (getstate/setstate, reduce).

Are numpy's arrays are pickled similar to Python's own array types? In
that case, when using Python 2.x, they're pickled very inefficiently
indeed (every element is encoded with its own token). In Python 3.x,
array pickling is very efficient because it stores the machine type
representation in the pickle.

Irmen

Robert Kern · Dec 6, 2011

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.

I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.

It is. Use protocol=HIGHEST_PROTOCOL when dumping the array to a pickle.

[~]
|1> big = np.linspace(0.0, 1.0, 500000)

[~]
|2> import cPickle

[~]
|3> len(cPickle.dumps(big))
11102362

[~]
|4> len(cPickle.dumps(big, protocol=cPickle.HIGHEST_PROTOCOL))
4000135

The original conception for pickle was that it would have an ASCII
representation for optimal cross-platform compatibility. These were the days
when people still used FTP regularly, and you could easily (and silently!) screw
up binary data if you sent it in ASCII mode by accident. This necessarily
creates large files for numpy arrays. Further iterations on the pickling
protocol let numpy use raw binary data in the pickle. However, for backwards
compatibility, the default protocol is the one Python started out with. If you
explicitly use the most recent protocol, then you will get the efficiency benefits.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Terry Reedy · Dec 6, 2011

On Nov 30, Irmen de Jong opened a tracker issue with a patch improve
bytearray pickling.
http://bugs.python.org/issue13503

Yesterday, Dec 5, Antoine Pitrou applied a revised fix.
http://hg.python.org/cpython/rev/e2959a6a1440/
The commit message:
"Issue #13503: Use a more efficient reduction format for bytearrays with
pickle protocol >= 3. The old reduction format is kept with older
protocols in order to allow unpickling under Python 2."

Irmen de Jong · Dec 7, 2011

On Nov 30, Irmen de Jong opened a tracker issue with a patch improve bytearray pickling.
http://bugs.python.org/issue13503

Yesterday, Dec 5, Antoine Pitrou applied a revised fix.
http://hg.python.org/cpython/rev/e2959a6a1440/
The commit message:
"Issue #13503: Use a more efficient reduction format for bytearrays with pickle protocol
under Python 2."

Sure, but this patch only improved the pickle behavior of the bytearray type for
protocol level 3. It didn't touch Python 2.x, nor the pickling of arrays (array.array),
let alone numpy arrays.

Irmen

why is bytearray treated so inefficiently by pickle?

Irmen de Jong

Terry Reedy

Irmen de Jong

John Ladasky

Irmen de Jong

Robert Kern

Terry Reedy

Irmen de Jong

Members online

Forum statistics

Latest Threads