save gb-2312 web page in a .html file

Peter Pei · Dec 26, 2007

I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.

Matt Nordhoff · Dec 26, 2007

Peter said:
I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.

..read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

(BTW, don't use 'file' as a variable name. It's an alias of the 'open()'
function.)
--

Peter Pei · Dec 26, 2007

You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

Matt Nordhoff · Dec 26, 2007

Peter said:
You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

I dunno. The file does specify its charset, so unless IE ignores that
and tries to guess and fails, it should work fine.
--

Martin v. LÃ¶wis · Dec 26, 2007

.read() returns the bytes exactly how it downloads them. It doesn't

interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

Alternatively, if the page is *not* gb-2312, you must first *decode*
it from its original encoding. Suppose the original encoding is
windows-1252, you do

page = page.decode("windows-1252")
page = page.encode("gb-2312")

Of course, for HTML, that may be tricky, as the file may include
an encoding declaration (XML declaration or http-equiv header). So if
you recode it, you might have to change such declarations as well.

Regards,
Martin

Peter Pei · Dec 26, 2007

I "view sourced" the original web page in IE7, and it does specify:

<meta http-equiv="MSThemeCompatible" content="Yes">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

So sounds like the encoding is gb2312...

Alihuen · Dec 26, 2007

----- Original Message -----
From: "Peter Pei" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Wednesday, December 26, 2007 8:22 PM
Subject: Re: save gb-2312 web page in a .html file

Improving the web page download code.	5	Aug 27, 2013
urllib, can't seem to get form post right	1	Sep 24, 2009
how to save a whole web page with something block	2	Aug 10, 2010
[cookielib] How to add cookies myself?	0	Dec 16, 2008
urllib2 login help	1	Feb 21, 2009
Access to objects in a frame on a web page	0	Sep 12, 2013
extracting from web pages but got disordered words sometimes	3	Jan 27, 2007
python - fetching, post, cookie question	0	Dec 22, 2009

save gb-2312 web page in a .html file

Peter Pei

Matt Nordhoff

Peter Pei

Matt Nordhoff

Martin v. LÃ¶wis

Peter Pei

Alihuen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads