Deflate with urllib2...

Sam · Sep 9, 2008

I'm using urllib2 and accepting gzip and deflate.

It turns out that almost every site returns either normal text or
gzip. But I finally found one that returns deflate.

Here's how I un-gzip:
compressedstream = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()

Un-gzipping works great!

Here's how I un-deflate (inflate??)
data = zlib.decompress(data)

Un-deflating doesn't work. I get "zlib.error: Error -3 while
decompressing data: incorrect header check"

I'm using python 2.5.2. Can someone tell me exactly how to handle
deflated web pages?

Thanks

Mohamed Yousef · Sep 9, 2008

Try this
http://www.paul.sladen.org/projects/pyflate/

Gabriel Genellina · Sep 10, 2008

En Tue said:
I'm using urllib2 and accepting gzip and deflate.

It turns out that almost every site returns either normal text or
gzip. But I finally found one that returns deflate.

Here's how I un-gzip:
compressedstream = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()

Un-gzipping works great!

Here's how I un-deflate (inflate??)
data = zlib.decompress(data)

Un-deflating doesn't work. I get "zlib.error: Error -3 while
decompressing data: incorrect header check"

I'm using python 2.5.2. Can someone tell me exactly how to handle
deflated web pages?

zlib.decompress should work - can you provide a site that uses deflate to
test?

Sam · Sep 17, 2008

Gabriel, et al.

It's hard to find a web site that uses deflate these days.

Luckily, slashdot to the rescue.

I even wrote a test script.

If someone can tell me what's wrong that would be great.

Here's what I get when I run it:
Data is compressed using deflate. Length is: 107160
Traceback (most recent call last):
File "my_deflate_test.py", line 19, in <module>
data = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

And here's my test script:

#!/usr/bin/env python

import urllib2
import zlib

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

stream = opener.open('http://www.slashdot.org')
data = stream.read()
encoded = stream.headers.get('Content-Encoding')

if encoded == 'deflate':
print "Data is compressed using deflate. Length is: ",
str(len(data))
data = zlib.decompress(data)
print "After uncompressing, length is: ", str(len(data))
else:
print "Data is not deflated."

Gabriel Genellina · Sep 18, 2008

En Tue said:
Gabriel, et al.

It's hard to find a web site that uses deflate these days.

Luckily, slashdot to the rescue.

I even wrote a test script.

If someone can tell me what's wrong that would be great.

Here's what I get when I run it:
Data is compressed using deflate. Length is: 107160
Traceback (most recent call last):
File "my_deflate_test.py", line 19, in <module>
data = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

And that's true. The slashdot server is sending bogus data:

py> s = socket.socket()
py> s.connect(('slashdot.org',80))
py> s.sendall("GET / HTTP/1.1\r\nHost: slashdot.org\r\nAccept-Encoding:
deflate\
r\n\r\n")
py> s.recv(500)
'HTTP/1.1 200 OK\r\nDate: Thu, 18 Sep 2008 20:48:34 GMT\r\nServer:
Apache/1.3.41
(Unix) mod_perl/1.31-rc4\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash
2.0050
01220\r\nX-Bender: Alright! Closure!\r\nCache-Control: private\r\nPragma:
privat
e\r\nConnection: close\r\nContent-Type: text/html;
charset=iso-8859-1\r\nVary: A
ccept-Encoding, User-Agent\r\nContent-Encoding:
deflate\r\nTransfer-Encoding: ch
unked\r\n\r\n1c76\r\n\x02\x00\x00\x00\xff\xff\x00\xc1\x0f>\xf0<!DOCTYPE
HTML PUB
LIC "-//W3C//DTD HTML 4.01//EN"\n
"http://www.w3.org/TR/html4/str...'

Note those 11 bytes starting with "\x02\x00\x00\xff..." followed by the
page contents in plain text.
According to RFC 2616 (HTTP 1.1), the deflate content coding consists of
the "zlib" format defined in RFC 1950 in combination with the "deflate"
compression mechanism described in RFC 1951. RFC 1950 says that the lower
4 bits of the first byte in a zlib stream represent the compression
method; the only compression method defined is "deflate" with value 8. The
slashdot data contains a 2 instead, so it is not valid.

#!/usr/bin/env python

import urllib2
import zlib

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

stream = opener.open('http://www.slashdot.org')
data = stream.read()
encoded = stream.headers.get('Content-Encoding')

if encoded == 'deflate':
print "Data is compressed using deflate. Length is: ",
str(len(data))
data = zlib.decompress(data)
print "After uncompressing, length is: ", str(len(data))
else:
print "Data is not deflated."

The code is correct - try with another server. I tested it with a
LightHTTPd server and worked fine.

Sam · Sep 19, 2008

En Tue, 16 Sep 2008 21:58:31 -0300, Sam <[email protected]> escribió:
The code is correct - try with another server. I tested it with a
LightHTTPd server and worked fine.

Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).

Here's the ones it fails on:
slashdot.org
hotmail.com
godaddy.com
linux.com
lighttpd.net

I did manage to find one webserver it succeeded on---that is
kenrockwel.com --- a domain squatter with a typoed domain of one of my
favorite photographer's websites (the actual website should be
kenrockwell.com)

This squatter's site is indeed running lighttpd---but it appears to be
an earlier version, because the official lighttpd site fails on this
test.

We have all the major web servers failing the test:
* Apache 1.3
* Apache 2.2
* Microsoft-IIS/6.0
* lighttpd/1.5.0

So I think it's the python side that is wrong, regardless of what the
standard is.

What should I do next?

I've rewritten the code to make it easier to test. Just run it as is
and it will try all my test cases; or pass in a site on the command
line, and it will try just that.

Thanks!

#!/usr/bin/env python
"""Put the site you want to test as a command line parameter.
Otherwise tests the list of defaults."""

import urllib2
import zlib
import sys

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

try:
sites = [sys.argv[1]]
except IndexError:
sites = ['http://slashdot.org', 'http://www.hotmail.com',
'http://www.godaddy.com', 'http://www.linux.com',
'http://www.lighttpd.net', 'http://www.kenrockwel.com']

for site in sites:
print "Trying: ", site
stream = opener.open(site)
data = stream.read()
encoded = stream.headers.get('Content-Encoding')
server = stream.headers.get('Server')

print " %s - %s (%s)" % (site, server, encoded)

if encoded == 'deflate':
before = len(data)
try:
data = zlib.decompress(data)
after = len(data)
print " Able to decompress...went from %i to %i." %
(before, after)
except zlib.error:
print " Errored out on this site."
else:
print " Data is not deflated."
print

Sam · Sep 19, 2008

For those that are interested, but don't want to bother running the
program themselves, here's the output I get.

Trying: http://slashdot.org
http://slashdot.org - Apache/1.3.41 (Unix) mod_perl/1.31-rc4
(deflate)
Errored out on this site.

Trying: http://www.hotmail.com
http://www.hotmail.com - Microsoft-IIS/6.0 (deflate)
Errored out on this site.

Trying: http://www.godaddy.com
http://www.godaddy.com - Microsoft-IIS/6.0 (deflate)
Errored out on this site.

Trying: http://www.linux.com
http://www.linux.com - Apache/2.2.8 (Unix) PHP/5.2.5 (deflate)
Errored out on this site.

Trying: http://www.lighttpd.net
http://www.lighttpd.net - lighttpd/1.5.0 (deflate)
Errored out on this site.

Trying: http://www.kenrockwel.com
http://www.kenrockwel.com - lighttpd (deflate)
Able to decompress...went from 414 to 744.

En Tue, 16 Sep 2008 21:58:31 -0300, Sam <[email protected]> escribió:
The code is correct - try with another server. I tested it with a
LightHTTPd server and worked fine.

Click to expand...

Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).

Here's the ones it fails on:
slashdot.org
hotmail.com
godaddy.com
linux.com
lighttpd.net

I did manage to find one webserver it succeeded on---that is
kenrockwel.com --- a domain squatter with a typoed domain of one of my
favorite photographer's websites (the actual website should be
kenrockwell.com)

This squatter's site is indeed running lighttpd---but it appears to be
an earlier version, because the official lighttpd site fails on this
test.

We have all the major web servers failing the test:
* Apache 1.3
* Apache 2.2
* Microsoft-IIS/6.0
* lighttpd/1.5.0

So I think it's the python side that is wrong, regardless of what the
standard is.

What should I do next?

I've rewritten the code to make it easier to test. Just run it as is
and it will try all my test cases; or pass in a site on the command
line, and it will try just that.

Thanks!

#!/usr/bin/env python
"""Put the site you want to test as a command line parameter.
Otherwise tests the list of defaults."""

import urllib2
import zlib
import sys

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

try:
sites = [sys.argv[1]]
except IndexError:
sites = ['http://slashdot.org', 'http://www.hotmail.com',
'http://www.godaddy.com', 'http://www.linux.com',
'http://www.lighttpd.net', 'http://www.kenrockwel.com']

for site in sites:
print "Trying: ", site
stream = opener.open(site)
data = stream.read()
encoded = stream.headers.get('Content-Encoding')
server = stream.headers.get('Server')

print " %s - %s (%s)" % (site, server, encoded)

if encoded == 'deflate':
before = len(data)
try:
data = zlib.decompress(data)
after = len(data)
print " Able to decompress...went from %i to %i." %
(before, after)
except zlib.error:
print " Errored out on this site."
else:
print " Data is not deflated."
print

Gabriel Genellina · Sep 19, 2008

En Thu said:
Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).

I'll try to check later. Anyway, why are you so interested in deflate?
Both "deflate" and "gzip" coding use the same algorithm and generate
exactly the same compressed stream, the only difference being the header
and tail format. Have you found any server supporting deflate that doesn't
support gzip as well?

Gabriel Genellina · Sep 19, 2008

En Thu said:
Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).

Here's the ones it fails on:
slashdot.org
hotmail.com
godaddy.com
linux.com
lighttpd.net

I did manage to find one webserver it succeeded on---that is
kenrockwel.com --- a domain squatter with a typoed domain of one of my
favorite photographer's websites (the actual website should be
kenrockwell.com)

This squatter's site is indeed running lighttpd---but it appears to be
an earlier version, because the official lighttpd site fails on this
test.

We have all the major web servers failing the test:
* Apache 1.3
* Apache 2.2
* Microsoft-IIS/6.0
* lighttpd/1.5.0

So I think it's the python side that is wrong, regardless of what the
standard is.

I've found the problem. The zlib header is missing (2 bytes), data begins
right with the compressed stream. You may decode it if you pass a negative
value for wsize:

try:
data = zlib.decompress(data)
except zlib.error:
data = zlib.decompress(data, -zlib.MAX_WBITS)

Note that this is clearly in violation of RFC 1950: the header is *not*
optional.

BTW, the curl developers had this same problem some time ago
<http://curl.haxx.se/mail/lib-2005-12/0130.html> and the proposed solution
is the same as above.

This is the output from your test script modified as above. (Note that in
some cases, the compressed stream is larger than the uncompressed data):

Trying: http://slashdot.org
http://slashdot.org - Apache/1.3.41 (Unix) mod_perl/1.31-rc4 (deflate)
len(def
late)=73174 len(gzip)=73208
Able to decompress...went from 73174 to 73073.

Trying: http://www.hotmail.com
http://www.hotmail.com - Microsoft-IIS/6.0 (deflate) len(deflate)=1609
len(gzi
p)=1635
Able to decompress...went from 1609 to 3969.

Trying: http://www.godaddy.com
http://www.godaddy.com - Microsoft-IIS/6.0 (deflate) len(deflate)=40646
len(gz
ip)=157141
Able to decompress...went from 40646 to 157141.

Trying: http://www.linux.com
http://www.linux.com - Apache/2.2.8 (Unix) PHP/5.2.5 (deflate)
len(deflate)=52
862 len(gzip)=52880
Able to decompress...went from 52862 to 52786.

Trying: http://www.lighttpd.net
http://www.lighttpd.net - lighttpd/1.5.0 (deflate) len(deflate)=5669
len(gzip)
=5687
Able to decompress...went from 5669 to 15746.

Trying: http://www.kenrockwel.com
http://www.kenrockwel.com - lighttpd (deflate) len(deflate)=414
len(gzip)=426
Able to decompress...went from 414 to 744.

Sam · Sep 20, 2008

Gabriel...

Awesome! Thank you so much for the solution.

And yeah, I found exactly one website that strangely enough only does
deflate, not gzip. I'd rather not say what website it is, since it's
small and not mine. They may be few and in between, but they do
exist.

Thanks

Python compressed URL post	0	Nov 5, 2004
Generate 16+MAX_WBITS decompressable data	0	Feb 12, 2013
Fetching a gzipped webpage	1	May 26, 2010
zlib Deflate to java.util.zip inflate problem	5	Apr 4, 2009
Progressive download with Urllib2.	0	Dec 6, 2008
How to handle a HTTP::Request with gzip, deflate headers	16	Dec 3, 2004
uncompress base64-gzipped string	0	Jun 12, 2009
OWA (Outlook Web Access) with urllib2	5	Sep 23, 2004

Deflate with urllib2...

Sam

Mohamed Yousef

Gabriel Genellina

Sam

Gabriel Genellina

Sam

Sam

Gabriel Genellina

Gabriel Genellina

Sam

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads