Get document as normal text and not as binary data

M

Markus Franz

Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÀÃt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
contents = f.read()
print contents
f.close()

Thanks!

Markus
 
D

Diez B. Roggisch

Markus said:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÀÃt, for example. How can I get the
contents as normal text?

You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.
 
F

Fredrik Lundh

Markus said:
I used urllib2 to load a html-document through http. But my problem
is: The loaded contents are returned as binary data, that means that every
character is displayed like lÃ?Ãt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)

adding

print f.headers

and checking the header fields (especially the content-type) may help you
figure out what's going on...
contents = f.read()
print contents
f.close()

</F>
 
M

Markus Franz

Diez said:
You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.

And how can I convert those binary data to a "normal" string with
"normal" characters?

Best regards

Markus
 
D

Diez B. Roggisch

Markus said:
And how can I convert those binary data to a "normal" string with
"normal" characters?

There is no "normal" - it's just bytes, and a string is just bytes. No
difference, no translation necessary.

As others have said: look into the http header what the server is trying to
transmit - maybe an image. The mimetype header is telling you that.

Or use wget to fetch the url and look what you get - it shouldn't look
different.
 
D

Diez B. Roggisch

Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.
 
K

Kent Johnson

Markus said:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÀÃt, for example. How can I get the
contents as normal text?

My guess is the html is utf-8 encoded - your sample looks like utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')

Kent
 
M

Markus Franz

Kent said:
My guess is the html is utf-8 encoded - your sample looks like
utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')

YES! That helped!

I used the following:

....
contents = f.read().decode('utf-8')
contents = contents.encode('iso-8859-15')
....

That was the perfect solution for my problem! Thanks a lot!

Best regards

Markus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top