zlib interface semi-broken

Travis · Feb 10, 2009

Hello all,

The zlib interface does not indicate when you've hit the end of a compressed stream.

The underlying zlib functionality provides for this.

With python's zlib, you have to read past the compressed data and into
the uncompressed, which gets stored in Decompress.unused_data.

As a result, if you've got a network protocol which mixes compressed
and non-compressed output, you may find a compressed block ending with
no uncompressed data following until you send another command -- which
a synchronous (non-pipelined) client will not send, because it is waiting
for the [compressed] data from the previous command to be finished.

As a result, you get a protocol deadlock.

A simple way to fix this would be to add a finished attribute to the
Decompress object.

However, perhaps this would be a good time to discuss how this library
works; it is somewhat awkward and perhaps there are other changes which
would make it cleaner.

What does the python community think?

Paul Rubin · Feb 10, 2009

Travis said:
However, perhaps this would be a good time to discuss how this library
works; it is somewhat awkward and perhaps there are other changes which
would make it cleaner.

What does the python community think?

It is missing some other features too, like the ability to preload
a dictionary. I'd support extending the interface.

Travis · Feb 10, 2009

Perhaps you could submit a patch with such a change?

Yes, I will try and get to that this week.

Well, it might be improvable, I haven't really looked. I personally
would like it and bz2 to get closer to each other in interface, rather
than to spread out. SO if you are really opening up a can of worms,
I vote for two cans.

Well, I like this idea; perhaps this is a good time to discuss the
equivalent of some "abstract base classes", or "interfaces", for
compression.

As I see it, the fundamental abstractions are the stream-oriented
de/compression routines. Given those, one should easily be able to
implement one-shot de/compression of strings. In fact, that is the
way that zlib is implemented; the base functions are the
stream-oriented ones and there is a layer on top of convenience
functions that do one-shot compression and decompression.

After examining the bz2 module, I notice that it has a file-like
interface called bz2file, which is roughly analogous to the gzip
module. That file interface could form a third API, and basically
conform to what python expects of files.

So what I suggest is a common framework of three APIs; a sequential
compression/decompression API for streams, a layer (potentially
generic) on top of those for strings/buffers, and a third API for
file-like access. Presumably the file-like access can be implemented
on top of the sequential API as well.

If the sequential de/compression routines are indeed primitive, and
sufficient for the implementation of the other two APIs, then that
gives us the option of implementing the other "upper" two layers in
pure python, potentially simplifying the amount of extension code that
has to be written. I see that as desirable, since it gives us options
for writing the upper two layers; in pure python, or by writing
extensions to the C code where available.

I seem to recall a number of ancilliary functions in zlib, such as
those for loading a compression dictionary. There are also options
such as flushing the compression in order to be able to resynchronize
should part of the archive become garbled. Where these functions are
available, they could be implemented, though it would be desirable to
give them the same name in each module to allow client code to test
for their existence in a compression-agnostic way.

For what it's worth, I would rather see a pythonic interface to the
libraries than a simple-as-can-be wrapper around the C functions. I
personally find it annoying to have to drop down to non-OOP styles in
a python program in order to use a C library. It doesn't matter to me
whether the OOP layer is added atop the C library in pure python or in
the C-to-python binding; that is an implementation detail to me, and I
suspect to most python programmers. They don't care, they just want
it easy to use from python. If performance turns out to matter, and
the underlying compression library supports an "upper layer" in C,
then we have the option for using that code.

So my suggestion is that we (the python users) brainstorm on how we
want the API to look, and not focus on the underlying library except
insofar as it informs our discussion of the proper APIs - for example,
features such as flushing state, setting compression levels/windows,
or for resynchronization points.

My further suggestion is that we start with the sequential
de/compression, since it seems like a fundamental primitive.
De/compressing strings will be trivial, and the file-like interface is
already described by Python.

So my first suggestion on the stream de/compression API thread is:

The sequential de/compression needs to be capable of returning
more than just the de/compressed data. It should at least be
capable of returning end-of-stream conditions and possibly
other states as well. I see a few ways of implementing this:

1) The de/compression object holds state in various members such as
data input buffers, data output buffers, and a state for indicating
states such as synchronization points or end-of-stream states. Member
functions are called and primarily manipulate the data members of the
object.

2) The de/compression object has routines for reading de/compressed
data and states such as end-of-stream or resynchronization points as
exceptions, much like the file class can throw EOFError. My problem
with this is that client code has to be cognizant of the possible
exceptions that might be thrown, and so one cannot easily add new
exceptions should the need arise. For example, if we add an exception
to indicate a possible resynchronization point, client code may not
be capable of handling it as a non-fatal exception.

Thoughts?

Paul Rubin · Feb 10, 2009

Scott David Daniels said:
I suspect that is why such an interface never came up (If
you can clone states, then you can say: "compress this, then use the
resultant state to compress/decompress others."

The zlib C interface supports something like that. It is just not
exported to the python application. It should be.

Paul Rubin · Feb 11, 2009

Scott David Daniels said:
Seems like we may want to say things like, "synchronization points are
too be silently ignored."

That would completely break some useful possible applications, so should
be avoided.

Paul Rubin · Feb 11, 2009

Scott David Daniels said:
No, I mean that we, _the_users_of_the_interface_, may want to say, ....
That is, I'd like that behavior as an option.

I don't see any reason to want that (rather than letting the application
handle it) but I'll take your word for it.

zlib and zip files	0	Apr 14, 2006
zlib && zip files	0	Apr 14, 2006
Possible Zlib Bug	2	Sep 10, 2007
zlib question (compression/uncompression fails) - demo atatched	4	Aug 19, 2007
mimetypes.guess_type broken in windows on py2.7 and python 3.X	0	Sep 26, 2012
Non-blocking and semi-blocking Sockets class.	14	Jan 18, 2007
Python "implements <interface>" equivalent?	11	Oct 3, 2007
Queue limitations?	7	Mar 15, 2006

zlib interface semi-broken

Travis

Paul Rubin

Travis

Paul Rubin

Paul Rubin

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads