UnicodeDecodeError having fetch web page

Barry · May 25, 2010

Hi,

The code below is giving me the error:

Traceback (most recent call last):
File "C:\Users\AdministratÃ¶r\Desktop\test.py", line 4, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
unexpected code byte

What am i doing wrong?

Thanks,

Barry

request = urllib.request.Request(url='http://en.wiktionary.org/wiki/
baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/
20071127 Firefox/2.0.0.11'} )

response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')

Philip Semanchuk · May 25, 2010

Hi,

The code below is giving me the error:

Traceback (most recent call last):
File "C:\Users\AdministratÃ¶r\Desktop\test.py", line 4, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
unexpected code byte

What am i doing wrong?

Thanks,

Barry

request = urllib.request.Request(url='http://en.wiktionary.org/wiki/
baby',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/
20071127 Firefox/2.0.0.11'} )

response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')

Well, for starters you're assuming that the response content is in
UTF-8. You need to examine the Content-Type header to see what the
encoding is. If it's not UTF-8, there's your problem.

HTH
P

Barry · May 25, 2010

Well, for starters you're assuming that the response content is in
UTF-8. You need to examine the Content-Type header to see what the
encoding is. If it's not UTF-8, there's your problem.

HTH
P

The content type is utf-8:

Date: Wed, 19 May 2010 19:17:39 GMT
Server: Apache
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
Last-Modified: Wed, 19 May 2010 10:10:34 GMT
Content-Encoding: gzip
Content-Length: 25247
Content-Type: text/html; charset=utf-8
X-Cache: HIT from sq61.wikimedia.org
X-Cache-Lookup: HIT from sq61.wikimedia.org:3128
Age: 520549
X-Cache: HIT from amssq32.esams.wikimedia.org
X-Cache-Lookup: HIT from amssq32.esams.wikimedia.org:3128
X-Cache: MISS from amssq37.esams.wikimedia.org
X-Cache-Lookup: MISS from amssq37.esams.wikimedia.org:80
Connection: close

Can it be that the page is corrupt? If so, how can I make the best of
the situation? Many other pages from this server work without problem.

Thanks!

Barry

Peter Otten · May 25, 2010

Barry said:
The content type is utf-8:

Date: Wed, 19 May 2010 19:17:39 GMT
Server: Apache
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
Last-Modified: Wed, 19 May 2010 10:10:34 GMT
Content-Encoding: gzip

But the data is gzipped. You have to uncompress it before decoding.

Peter

Rob Williscroft · May 25, 2010

Barry wrote in @m21g2000vbr.googlegroups.com in gmane.comp.python.general:

Hi,

The code below is giving me the error:

Traceback (most recent call last):
File "C:\Users\AdministratÃ¶r\Desktop\test.py", line 4, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
unexpected code byte

What am i doing wrong?

It may not be you, en.wiktionary.org is sending gzip
encoded content back, it seems to do this even if you set
the Accept header as in:

request.add_header( "Accept", "text/html" )

But maybe I'm not doing it correctly.

#encoding: utf-8
import urllib
import urllib.request

request = urllib.request.Request
(url='http://en.wiktionary.org/wiki/baby',headers={'User-
Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127
Firefox/2.0.0.11'} )

response = urllib.request.urlopen(request)
info = response.info()
enc = info[ 'Content-Encoding' ]
print( "Encoding: " + enc )

from io import BytesIO
import gzip

buf = BytesIO( response.read() )
unziped = gzip.GzipFile( "wahatever", mode = 'rb', fileobj = buf )
html = unziped.read().decode('utf-8')

print( html.encode( "ascii", "backslashreplace" ) )

Rob.

Philip Semanchuk · May 25, 2010

The content type is utf-8:

Date: Wed, 19 May 2010 19:17:39 GMT
Server: Apache
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
Last-Modified: Wed, 19 May 2010 10:10:34 GMT
Content-Encoding: gzip
Content-Length: 25247
Content-Type: text/html; charset=utf-8
X-Cache: HIT from sq61.wikimedia.org
X-Cache-Lookup: HIT from sq61.wikimedia.org:3128
Age: 520549
X-Cache: HIT from amssq32.esams.wikimedia.org
X-Cache-Lookup: HIT from amssq32.esams.wikimedia.org:3128
X-Cache: MISS from amssq37.esams.wikimedia.org
X-Cache-Lookup: MISS from amssq37.esams.wikimedia.org:80
Connection: close

Looks like the content is gzipped. Have you unzipped it? Also, from
where are you getting those headers? The server might well send
different headers to your browser than to a urllib request.

Have you examined the raw content in a hex editor on in the debugger?
That would probably answer a lot of questions.

Can it be that the page is corrupt?

Of course that's always possible, but personally whenever I have to
decide whether bits are being flipped at random or my code is buggy,
it's almost always the latter.

If so, how can I make the best of the situation?

Depends on what you're trying to accomplish.

bye
Philip

John Machin · May 26, 2010

Rob Williscroft said:
Barry wrote in @m21g2000vbr.googlegroups.com in gmane.comp.python.general:

It may not be you, en.wiktionary.org is sending gzip
encoded content back,

It sure is; here's where the offending 0x8b comes from:

"""ID1 (IDentification 1)
ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
(0x8b, \213), to identify the file as being in gzip format."""

(from http://www.faqs.org/rfcs/rfc1952.html)

Kushal Kumaran · May 26, 2010

Barry wrote in @m21g2000vbr.googlegroups.com in gmane.comp.python.general:

It may not be you, en.wiktionary.org is sending gzip
encoded content back, it seems to do this even if you set
the Accept header as in:

request.add_header( "Accept", "text/html" )

But maybe I'm not doing it correctly.

You need the Accept-Encoding: identity header.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

<snip>

Rob Williscroft · May 26, 2010

Kushal Kumaran wrote in in
gmane.comp.python.general:

You need the Accept-Encoding: identity header.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

Thanks, following this I did change the line to be:

request.add_header( "Accept-Encoding", "identity" )

but it made no difference to en.wiktionary.org it just sent the
back a gzip encoded response.

Rob.

Kushal Kumaran · May 27, 2010

Kushal Kumaran wrote in in
gmane.comp.python.general:

Thanks, following this I did change the line to be:

request.add_header( "Accept-Encoding", "identity" )

but it made no difference to en.wiktionary.org it just sent the
back a gzip encoded response.

A known problem, I guess... https://bugzilla.wikimedia.org/show_bug.cgi?id=7098

You'll just have to handle the gzipped data.

Fetching a gzipped webpage	1	May 26, 2010
xml-rpc UnicodeDecodeError	0	Jun 10, 2010

UnicodeDecodeError having fetch web page

Barry

Philip Semanchuk

Barry

Peter Otten

Rob Williscroft

Philip Semanchuk

John Machin

Kushal Kumaran

Rob Williscroft

Kushal Kumaran

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads