save gb-2312 web page in a .html file

P

Peter Pei

I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.
 
M

Matt Nordhoff

Peter said:
I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.

..read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

(BTW, don't use 'file' as a variable name. It's an alias of the 'open()'
function.)
--
 
P

Peter Pei

You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()
 
M

Matt Nordhoff

Peter said:
You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

I dunno. The file does specify its charset, so unless IE ignores that
and tries to guess and fails, it should work fine.
--
 
M

Martin v. Löwis

.read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

Alternatively, if the page is *not* gb-2312, you must first *decode*
it from its original encoding. Suppose the original encoding is
windows-1252, you do

page = page.decode("windows-1252")
page = page.encode("gb-2312")

Of course, for HTML, that may be tricky, as the file may include
an encoding declaration (XML declaration or http-equiv header). So if
you recode it, you might have to change such declarations as well.

Regards,
Martin
 
P

Peter Pei

I "view sourced" the original web page in IE7, and it does specify:

<meta http-equiv="MSThemeCompatible" content="Yes">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

So sounds like the encoding is gb2312...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top