Get document as normal text and not as binary data

Markus Franz · Mar 27, 2005

Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÃ€Ãt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
contents = f.read()
print contents
f.close()

Thanks!

Markus

Diez B. Roggisch · Mar 27, 2005

Markus said:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÃ€Ãt, for example. How can I get the
contents as normal text?

You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.

Fredrik Lundh · Mar 27, 2005

Markus said:
I used urllib2 to load a html-document through http. But my problem
is: The loaded contents are returned as binary data, that means that every
character is displayed like lÃ?Ãt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)

adding

print f.headers

and checking the header fields (especially the content-type) may help you
figure out what's going on...

contents = f.read()
print contents
f.close()

</F>

Markus Franz · Mar 28, 2005

Diez said:
You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.

And how can I convert those binary data to a "normal" string with
"normal" characters?

Best regards

Markus

Diez B. Roggisch · Mar 28, 2005

Markus said:
And how can I convert those binary data to a "normal" string with
"normal" characters?

There is no "normal" - it's just bytes, and a string is just bytes. No
difference, no translation necessary.

As others have said: look into the http header what the server is trying to
transmit - maybe an image. The mimetype header is telling you that.

Or use wget to fetch the url and look what you get - it shouldn't look
different.

Diez B. Roggisch · Mar 28, 2005

Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.

Kent Johnson · Mar 28, 2005

Markus said:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÃ€Ãt, for example. How can I get the
contents as normal text?

My guess is the html is utf-8 encoded - your sample looks like utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')

Kent

Markus Franz · Mar 29, 2005

Kent said:
My guess is the html is utf-8 encoded - your sample looks like
utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')

YES! That helped!

I used the following:

....
contents = f.read().decode('utf-8')
contents = contents.encode('iso-8859-15')
....

That was the perfect solution for my problem! Thanks a lot!

Best regards

Markus

Markus Franz · Mar 29, 2005

Diez said:
Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.

To guess my problem please have a look at the document title of
<http://portal.suse.de/sdb/de/1997/01/xntp.html>

Markus

urllib2 problem, data param not working?	0	Mar 31, 2009
help on HTTP 400 Bad Request syntax error on urllib2.urlopen	0	Jan 10, 2012
xmlrpclib and binary data as normal parameter strings	3	Apr 19, 2005
urllib2 and HTTPBasicAuthHandler	3	Jan 16, 2007
Python and Windows Services Question	1	Mar 4, 2009
urllib post and redirect = fail	0	Dec 11, 2009
IMAP4_SSL, libgmail, GMail and corporate firewall/proxy	1	Feb 17, 2011
urllib2 login help	1	Feb 21, 2009

Get document as normal text and not as binary data

Markus Franz

Diez B. Roggisch

Fredrik Lundh

Markus Franz

Diez B. Roggisch

Diez B. Roggisch

Kent Johnson

Markus Franz

Markus Franz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads