Lad -
Well, here's what I've got so far. I'll leave the extraction of the
description to you as an exercise, but as a clue, it looks like it is
delimited by "<b>View Detail</b></a></td></tr></tbody></table> <br>" at
the beginning, and "Quantity: 500<br>" at the end, where 500 could be
any number. This program will print out:
['Title:', 'Sell 2.4GHz Wireless Mini Color Camera With Audio Function
Manufacturers Hong Kong - Exporters, Suppliers, Factories, Seller']
['Contact:', 'Mr. Simon Cheung']
['Company:', 'Lanjin Electronics Co., Ltd.']
['Address:', 'Rm 602, 6/F., Tung Ning Bldg., 2 Hillier Street, Sheung
Wan , Hong Kong\n , HK\n ( Hong Kong
)']
['Phone:', '852 35763877']
['Fax:', '852 31056238']
['Mobile:', '852-96439737']
So I think pyparsing will get you pretty far along the way. Code
attached below (unfortunately, I am posting thru Google Groups, which
strips leading whitespace, so I have inserted '.'s to preserve code
indentation; just strip the leading '.' characters).
-- Paul
===================================
from pyparsing import *
import urllib
# get input data
url = "
http://www.ourglobalmarket.com/Test.htm"
page = urllib.urlopen( url )
pageHTML = page.read()
page.close()
#~ I would like to extract the tittle ( it is below Lanjin Electronics
#~ Co., Ltd. )
#~ (Sell 2.4GHz Wireless Mini Color Camera With Audio Function )
#~ description - below the tittle next to the picture
#~ Contact person
#~ Company name
#~ Address
#~ fax
#~ phone
#~ Website Address
LANGBRK = Literal("<")
RANGBRK = Literal(">")
SLASH = Literal("/")
tagAttr = Word(alphanums) + "=" + dblQuotedString
# helpers for defining HTML tag expressions
def startTag( tagname ):
.....return ( LANGBRK + CaselessLiteral(tagname) + \
................ZeroOrMore(tagAttr) + RANGBRK ).suppress()
def endTag( tagname ):
.....return ( LANGBRK + SLASH + CaselessLiteral(tagname) + RANGBRK
).suppress()
def makeHTMLtags( tagname ):
.....return startTag(tagname), endTag(tagname)
def strong( expr ):
.....return strongStartTag + expr + strongEndTag
strongStartTag, strongEndTag = makeHTMLtags("strong")
titleStart, titleEnd = makeHTMLtags("title")
tdStart, tdEnd = makeHTMLtags("td")
h1Start, h1End = makeHTMLtags("h1")
title = titleStart + SkipTo( titleEnd ).setResultsName("title") +
titleEnd
contactPerson = tdStart + h1Start + \
................SkipTo( h1End ).setResultsName("contact")
company = ( tdStart + strong("Company:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("company")
address = ( tdStart + strong("Address:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("address")
phoneNum = ( tdStart + strong("Phone:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("phoneNum")
faxNum = ( tdStart + strong("Fax:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("faxNum")
mobileNum = ( tdStart + strong("Mobile:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("mobileNum")
webSite = ( tdStart + strong("Website Address:") + tdEnd + tdStart )
+ \
................SkipTo( tdEnd ).setResultsName("webSite")
scrapes = title | contactPerson | company | address | phoneNum | faxNum
| mobileNum | webSite
# use parse actions to remove hyperlinks
linkStart, linkEnd = makeHTMLtags("a")
linkExpr = linkStart + SkipTo( linkEnd ) + linkEnd
def stripHyperLink(s,l,t):
.....return [ t[0], linkExpr.transformString( t[1] ) ]
company.setParseAction( stripHyperLink )
# use parse actions to add labels for data elements that don't
# have labels in the HTML
def prependLabel(pre):
.....def prependAction(s,l,t):
.........return [pre] + t[:]
.....return prependAction
title.setParseAction( prependLabel("Title:") )
contactPerson.setParseAction( prependLabel("Contact:") )
for tokens,start,end in scrapes.scanString( pageHTML ):
.....print tokens