extracting from web pages but got disordered words sometimes

Discussion in 'Python' started by Frank Potter, Jan 27, 2007.

  1. Frank Potter

    Frank Potter Guest

    There are ten web pages I want to deal with.
    from http://www.af.shejis.com/new_lw/html/125926.shtml
    to http://www.af.shejis.com/new_lw/html/125936.shtml

    Each of them uses the charset of Chinese "gb2312", and firefox
    displays all of them in the right form, that's readable Chinese.

    My job is, I get every page and extract the html title of it and
    dispaly the title on linux shell Termial.

    And, my problem is, to some page, I get human readable title(that's in
    Chinese), but to other pages, I got disordered word. Since each page
    has the same charset, I don't know why I can't get every title in the
    same way.

    Here's my python code, get_title.py :

    Code:
    #!/usr/bin/python
    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    min_page=125926
    max_page=125936
    
    def make_page_url(page_index):
        return ur"".join([ur"http://www.af.shejis.com/new_lw/
    html/",str(page_index),ur".shtml"])
    
    def get_page_title(page_index):
        url=make_page_url(page_index)
        print "now getting: ", url
        user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers={'User-Agent':user_agent}
        req=urllib2.Request(url,None,headers)
        response=urllib2.urlopen(req)
        #print response.info()
        page=response.read()
    
        #extract tile by beautiful soup
        soup=BeautifulSoup(page)
        full_title=str(soup.html.head.title.string)
    
        #title is in the format of "title --title"
        #use this code to delete the "--" and the duplicate title
        title=full_title[full_title.rfind('-')+1::]
    
        return title
    
    for i in xrange(min_page,max_page):
        print get_page_title(i)
    
    Will somebody please help me out? Thanks in advance.
     
    Frank Potter, Jan 27, 2007
    #1
    1. Advertising

  2. Frank Potter

    Paul McGuire Guest

    On Jan 27, 5:18 am, "Frank Potter" <> wrote:
    > There are ten web pages I want to deal with.
    > fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
    > to http://www.af.shejis.com/new_lw/html/125936.shtml
    >
    > Each of them uses the charset of Chinese "gb2312", and firefox
    > displays all of them in the right form, that's readable Chinese.
    >
    > My job is, I get every page and extract the html title of it and
    > dispaly the title on linux shell Termial.
    >
    > And, my problem is, to some page, I get human readable title(that's in
    > Chinese), but to other pages, I got disordered word. Since each page
    > has the same charset, I don't know why I can't get every title in the
    > same way.
    >
    > Here's my python code, get_title.py :
    >
    >
    Code:
    > #!/usr/bin/python
    > import urllib2
    > from BeautifulSoup import BeautifulSoup
    >
    > min_page=125926
    > max_page=125936
    >
    > def make_page_url(page_index):
    >     return ur"".join([ur"http://www.af.shejis.com/new_lw/
    > html/",str(page_index),ur".shtml"])
    >
    > def get_page_title(page_index):
    >     url=make_page_url(page_index)
    >     print "now getting: ", url
    >     user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    >     headers={'User-Agent':user_agent}
    >     req=urllib2.Request(url,None,headers)
    >     response=urllib2.urlopen(req)
    >     #print response.info()
    >     page=response.read()
    >
    >     #extract tile by beautiful soup
    >     soup=BeautifulSoup(page)
    >     full_title=str(soup.html.head.title.string)
    >
    >     #title is in the format of "title --title"
    >     #use this code to delete the "--" and the duplicate title
    >     title=full_title[full_title.rfind('-')+1::]
    >
    >     return title
    >
    > for i in xrange(min_page,max_page):
    >     print get_page_title(i)
    > 
    >
    > Will somebody please help me out? Thanks in advance.


    This pyparsing solution seems to extract what you were looking for,
    but I don't know if this will render to Chinese or not.

    -- Paul

    from pyparsing import makeHTMLTags,SkipTo
    import urllib

    titleStart,titleEnd = makeHTMLTags("title")
    scanExpr = titleStart + SkipTo("- -",include=True) +
    SkipTo(titleEnd).setResultsName("titleChars") + titleEnd

    def extractTitle(htmlSource):
    titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
    return titleSource.titleChars


    for urlIndex in range(125926,125936+1):
    url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
    pg = urllib.urlopen(url)
    html = pg.read()
    pg.close()
    print url,':',extractTitle(html)


    Gives:

    http://www.af.shejis.com/new_lw/html/125926.shtml : GSM±¾µØÍø×éÍø·½Ê½
    http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
    ±¾µØÍø×éÍø·½Ê½³õ̽
    http://www.af.shejis.com/new_lw/html/125928.shtml : GSMµÄÊý¾ÝÒµÎñ
    http://www.af.shejis.com/new_lw/html/125929.shtml :
    GSMµÄÊý¾ÝÒµÎñºÍ³ÐÔØÄÜÁ¦
    http://www.af.shejis.com/new_lw/html/125930.shtml : GSMµÄÍøÂçÑݽø-
    ´ÓGSMµ½GPRSµ½3G £¨¸½Í¼£©
    http://www.af.shejis.com/new_lw/html/125931.shtml : GSM¶ÌÏûÏ
    ¢ÒµÎñÔÚË®Çé×Ô¶¯²â±¨ÏµÍ³ÖеÄÓ¦ÓìØ
    http://www.af.shejis.com/new_lw/html/125932.shtml : £Ç£Ó
    £Í½»»»ÏµÍ³µÄÍøÂçÓÅ»¯
    http://www.af.shejis.com/new_lw/html/125933.shtml : GSMÇл»µô»°µÄ·ÖÎö¼
    °½â¾ö°ì·¨
    http://www.af.shejis.com/new_lw/html/125934.shtml : GSMÊÖ»ú²¦½ÐÊл°Ä
    £¿é¾ÖÓû§¹ÊÕϵÄÆÊÎö
    http://www.af.shejis.com/new_lw/html/125935.shtml :
    GSMÊÖ»úµ½WCDMAÖն˵ÄÑݱä
    http://www.af.shejis.com/new_lw/html/125936.shtml : GSMÊÖ»úµÄάÐÞ·½·¨
     
    Paul McGuire, Jan 27, 2007
    #2
    1. Advertising

  3. Frank Potter

    Paul McGuire Guest

    After looking at the pyparsing results, I think I see the problem with
    your original code. You are selecting only the characters after the
    rightmost "-" character, but you really want to select everything to
    the right of "- -". In some of the titles, the encoded Chinese
    includes a "-" character, so you are chopping off everything before
    that.

    Try changing your code to:
    title=full_title.split("- -")[1]

    I think then your original program will work.

    -- Paul
     
    Paul McGuire, Jan 27, 2007
    #3
  4. Frank Potter

    Frank Potter Guest

    Thank you, I tried again and I figured it out.
    That's something with beautiful soup, I worked with it a year ago also
    dealing with Chinese html pages and nothing error happened. I read the
    old code and I find the difference. Change the page to unicode before
    feeding to beautiful soup, then everything will be OK.

    On Jan 28, 3:26 am, "Paul McGuire" <> wrote:
    > After looking at the pyparsing results, I think I see the problem with
    > your original code. You are selecting only the characters after the
    > rightmost "-" character, but you really want to select everything to
    > the right of "- -". In some of the titles, the encoded Chinese
    > includes a "-" character, so you are chopping off everything before
    > that.
    >
    > Try changing your code to:
    > title=full_title.split("- -")[1]
    >
    > I think then your original program will work.
    >
    > -- Paul
     
    Frank Potter, Jan 28, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Saunders
    Replies:
    0
    Views:
    470
    John Saunders
    Aug 28, 2003
  2. Nehmo Sergheyev
    Replies:
    1
    Views:
    508
    Andrew Urquhart
    May 9, 2004
  3. Marcin Vorbrodt

    ::std sometimes needed, sometimes not

    Marcin Vorbrodt, Sep 16, 2003, in forum: C++
    Replies:
    24
    Views:
    765
    Jerry Coffin
    Sep 17, 2003
  4. Replies:
    1
    Views:
    506
    gkelly
    Nov 29, 2006
  5. Randy Smith
    Replies:
    2
    Views:
    449
    Randy Smith
    Apr 24, 2007
Loading...

Share This Page