save gb-2312 web page in a .html file

Discussion in 'Python' started by Peter Pei, Dec 26, 2007.

  1. Peter Pei

    Peter Pei Guest

    I am trying to read a web page and save it in a .html file. The problem is
    that the web page is GB-2312 encoded, and I want to save it to the file with
    the same encoding or unicode. I have some code like this:
    url = 'http://blah/'
    headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
    NT)' }

    req = urllib2.Request(url, None, headers)
    page = urllib2.urlopen(req).read()

    file = open('btchina.html','wb')
    file.write(page.encode('gb-2312'))
    file.close()

    It is obviously not working, and I am hoping someone can help me.
    Peter Pei, Dec 26, 2007
    #1
    1. Advertising

  2. Peter Pei wrote:
    > I am trying to read a web page and save it in a .html file. The problem is
    > that the web page is GB-2312 encoded, and I want to save it to the file with
    > the same encoding or unicode. I have some code like this:
    > url = 'http://blah/'
    > headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
    > NT)' }
    >
    > req = urllib2.Request(url, None, headers)
    > page = urllib2.urlopen(req).read()
    >
    > file = open('btchina.html','wb')
    > file.write(page.encode('gb-2312'))
    > file.close()
    >
    > It is obviously not working, and I am hoping someone can help me.


    ..read() returns the bytes exactly how it downloads them. It doesn't
    interpret them. If those bytes are GB-2312-encoded text, that's what
    they are. There's no need to reencode them. Just .write(page) (of
    course, this way you don't verify that it's correct).

    (BTW, don't use 'file' as a variable name. It's an alias of the 'open()'
    function.)
    --
    Matt Nordhoff, Dec 26, 2007
    #2
    1. Advertising

  3. Peter Pei

    Peter Pei Guest

    You must be right, since I tried one page and it worked. But there is
    something wrong with this particular page:
    http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
    IE7), it is all messed up.

    url = 'http://overseas.btchina.net/?categoryid=-1'
    headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
    NT)' }
    req = urllib2.Request(url, None, headers)
    page = urllib2.urlopen(req).read()

    htmlfile = open('btchina.html','w')
    htmlfile.write(page)
    htmlfile.close()
    Peter Pei, Dec 26, 2007
    #3
  4. Peter Pei wrote:
    > You must be right, since I tried one page and it worked. But there is
    > something wrong with this particular page:
    > http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
    > IE7), it is all messed up.
    >
    > url = 'http://overseas.btchina.net/?categoryid=-1'
    > headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
    > NT)' }
    > req = urllib2.Request(url, None, headers)
    > page = urllib2.urlopen(req).read()
    >
    > htmlfile = open('btchina.html','w')
    > htmlfile.write(page)
    > htmlfile.close()


    I dunno. The file does specify its charset, so unless IE ignores that
    and tries to guess and fails, it should work fine.
    --
    Matt Nordhoff, Dec 26, 2007
    #4
  5. > .read() returns the bytes exactly how it downloads them. It doesn't
    > interpret them. If those bytes are GB-2312-encoded text, that's what
    > they are. There's no need to reencode them. Just .write(page) (of
    > course, this way you don't verify that it's correct).


    Alternatively, if the page is *not* gb-2312, you must first *decode*
    it from its original encoding. Suppose the original encoding is
    windows-1252, you do

    page = page.decode("windows-1252")
    page = page.encode("gb-2312")

    Of course, for HTML, that may be tricky, as the file may include
    an encoding declaration (XML declaration or http-equiv header). So if
    you recode it, you might have to change such declarations as well.

    Regards,
    Martin
    Martin v. Löwis, Dec 26, 2007
    #5
  6. Peter Pei

    Peter Pei Guest

    I "view sourced" the original web page in IE7, and it does specify:

    <meta http-equiv="MSThemeCompatible" content="Yes">
    <meta http-equiv="Content-Type" content="text/html; charset=gb2312">

    So sounds like the encoding is gb2312...
    Peter Pei, Dec 26, 2007
    #6
  7. Peter Pei

    Alihuen Guest

    Unsubscribe

    ----- Original Message -----
    From: "Peter Pei" <>
    Newsgroups: comp.lang.python
    To: <>
    Sent: Wednesday, December 26, 2007 8:22 PM
    Subject: Re: save gb-2312 web page in a .html file


    > You must be right, since I tried one page and it worked. But there is
    > something wrong with this particular page:
    > http://overseas.btchina.net/?categoryid=-1. When I open the saved file
    > (with
    > IE7), it is all messed up.
    >
    > url = 'http://overseas.btchina.net/?categoryid=-1'
    > headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
    > NT)' }
    > req = urllib2.Request(url, None, headers)
    > page = urllib2.urlopen(req).read()
    >
    > htmlfile = open('btchina.html','w')
    > htmlfile.write(page)
    > htmlfile.close()
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    >
    > --
    > No virus found in this incoming message.
    > Checked by AVG Free Edition.
    > Version: 7.5.516 / Virus Database: 269.17.9/1197 - Release Date:
    > 25/12/2007 20:04
    >
    >
    Alihuen, Dec 26, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sincethe2003
    Replies:
    2
    Views:
    531
    Craig Deelsnyder
    Jul 14, 2004
  2. Homer

    hz-gb-2312 solution

    Homer, Jan 16, 2006, in forum: Java
    Replies:
    2
    Views:
    707
    Homer
    Jan 16, 2006
  3. subhadip
    Replies:
    0
    Views:
    613
    subhadip
    Mar 28, 2007
  4. Michael
    Replies:
    0
    Views:
    156
    Michael
    Aug 14, 2005
  5. Shahar Golan
    Replies:
    5
    Views:
    275
    kaeli
    Oct 16, 2003
Loading...

Share This Page