extracting from web pages but got disordered words sometimes

Frank Potter · Jan 27, 2007

There are ten web pages I want to deal with.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.

Here's my python code, get_title.py :

Code:

#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup

min_page=125926
max_page=125936

def make_page_url(page_index):
    return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])

def get_page_title(page_index):
    url=make_page_url(page_index)
    print "now getting: ", url
    user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers={'User-Agent':user_agent}
    req=urllib2.Request(url,None,headers)
    response=urllib2.urlopen(req)
    #print response.info()
    page=response.read()

    #extract tile by beautiful soup
    soup=BeautifulSoup(page)
    full_title=str(soup.html.head.title.string)

    #title is in the format of "title --title"
    #use this code to delete the "--" and the duplicate title
    title=full_title[full_title.rfind('-')+1::]

    return title

for i in xrange(min_page,max_page):
    print get_page_title(i)

Will somebody please help me out? Thanks in advance.

Paul McGuire · Jan 27, 2007

Code:
There are ten web pages I want to deal with.
fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.

Here's my python code, get_title.py :

Code:

#!/usr/bin/python import urllib2 from BeautifulSoup import BeautifulSoup min_page=125926 max_page=125936 def make_page_url(page_index): return ur"".join([ur"http://www.af.shejis.com/new_lw/ html/",str(page_index),ur".shtml"]) def get_page_title(page_index): url=make_page_url(page_index) print "now getting: ", url user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers={'User-Agent':user_agent} req=urllib2.Request(url,None,headers) response=urllib2.urlopen(req) #print response.info() page=response.read() #extract tile by beautiful soup soup=BeautifulSoup(page) full_title=str(soup.html.head.title.string) #title is in the format of "title --title" #use this code to delete the "--" and the duplicate title title=full_title[full_title.rfind('-')+1::] return title for i in xrange(min_page,max_page): print get_page_title(i)

Will somebody please help me out? Thanks in advance.

This pyparsing solution seems to extract what you were looking for,
but I don't know if this will render to Chinese or not.

-- Paul

from pyparsing import makeHTMLTags,SkipTo
import urllib

titleStart,titleEnd = makeHTMLTags("title")
scanExpr = titleStart + SkipTo("- -",include=True) +
SkipTo(titleEnd).setResultsName("titleChars") + titleEnd

def extractTitle(htmlSource):
titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
return titleSource.titleChars

for urlIndex in range(125926,125936+1):
url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
pg = urllib.urlopen(url)
html = pg.read()
pg.close()
print url,':',extractTitle(html)

Gives:

http://www.af.shejis.com/new_lw/html/125926.shtml : GSM±¾µØÍø×éÍø·½Ê½
http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
±¾µØÍø×éÍø·½Ê½³õÌ½
http://www.af.shejis.com/new_lw/html/125928.shtml : GSMµÄÊý¾ÝÒµÎñ
http://www.af.shejis.com/new_lw/html/125929.shtml :
GSMµÄÊý¾ÝÒµÎñºÍ³ÐÔØÄÜÁ¦
http://www.af.shejis.com/new_lw/html/125930.shtml : GSMµÄÍøÂçÑÝ½ø-
´ÓGSMµ½GPRSµ½3G £¨¸½Í¼£©
http://www.af.shejis.com/new_lw/html/125931.shtml : GSM¶ÌÏûÏ
¢ÒµÎñÔÚË®Çé×Ô¶¯²â±¨ÏµÍ³ÖÐµÄÓ¦ÓÃ¬Ø
http://www.af.shejis.com/new_lw/html/125932.shtml : £Ç£Ó
£Í½»»»ÏµÍ³µÄÍøÂçÓÅ»¯
http://www.af.shejis.com/new_lw/html/125933.shtml : GSMÇÐ»»µô»°µÄ·ÖÎö¼
°½â¾ö°ì·¨
http://www.af.shejis.com/new_lw/html/125934.shtml : GSMÊÖ»ú²¦½ÐÊÐ»°Ä
£¿é¾ÖÓÃ»§¹ÊÕÏµÄÆÊÎö
http://www.af.shejis.com/new_lw/html/125935.shtml :
GSMÊÖ»úµ½WCDMAÖÕ¶ËµÄÑÝ±ä
http://www.af.shejis.com/new_lw/html/125936.shtml : GSMÊÖ»úµÄÎ¬ÐÞ·½·¨

Paul McGuire · Jan 27, 2007

After looking at the pyparsing results, I think I see the problem with
your original code. You are selecting only the characters after the
rightmost "-" character, but you really want to select everything to
the right of "- -". In some of the titles, the encoded Chinese
includes a "-" character, so you are chopping off everything before
that.

Try changing your code to:
title=full_title.split("- -")[1]

I think then your original program will work.

-- Paul

Frank Potter · Jan 28, 2007

Thank you, I tried again and I figured it out.
That's something with beautiful soup, I worked with it a year ago also
dealing with Chinese html pages and nothing error happened. I read the
old code and I find the difference. Change the page to unicode before
feeding to beautiful soup, then everything will be OK.

strip away html tags from extracted links	2	Nov 29, 2013
Extracting text from a Webpage using BeautifulSoup	3	May 27, 2008
Improving the web page download code.	5	Aug 27, 2013
writing a csv file	1	Nov 12, 2012
Crummy BS Script	8	Oct 1, 2010
python - fetching, post, cookie question	0	Dec 22, 2009
tidy to convert google scholar page in xml	1	Oct 8, 2012
ChatBot	4	Jan 19, 2021

extracting from web pages but got disordered words sometimes

Frank Potter

Paul McGuire

Paul McGuire

Frank Potter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads