Pyparsing - Dealing with a Blank Value

S

Steve

Hi All,

I've picked up the PyParsing module and am trying to figure out how to
do a simple parsing of some HTML source code. My specific problem is
dealing with an <TD></TD> element that is blank.



from pyparsing import *
import sys


integer = Word("0123456789")

trStart = Literal("<TR>").suppress()
trEnd = Literal("</TR>").suppress()

tdStart = Literal("<TD>").suppress()
tdEnd = Literal("</TD>").suppress()

#dataItem = Word(alphas)
BlankItem = Word('')
dataItem = Word(alphanums + " " + "," + ":") # works with spaces in
data
MultiItem = Optional(OneOrMore(dataItem))

TestLine = ['<TR><TD>Group</TD><TD>Year</TD><TD>City</TD></TR>',
'<TR><TD>AAA</TD><TD>1992</TD><TD>Los Angeles</TD></TR>',
'<TR><TD>BBB</TD><TD>2007</TD><TD>Santa Cruz</TD></TR>',
'<TR><TD></TD><TD>2001</TD><TD>Santa Cruz</TD></TR>']

htmlLine = trStart + tdStart + MultiItem.setResultsName('status') +
tdEnd + tdStart + MultiItem.setResultsName('year') + tdEnd + tdStart +
MultiItem.setResultsName('title') + tdEnd + trEnd


for CurrentLine in TestLine:
print 'Line = ', CurrentLine

for srvrtokens,startloc,endloc in htmlLine.scanString( CurrentLine ):
print 'tokens = %s %d %d \n' % (srvrtokens, startloc,endloc)


Output :

Line = <TR><TD>Group</TD><TD>Year</TD><TD>City</TD></TR>
tokens = ['Group', 'Year', 'City'] 0 49

Line = <TR><TD>AAA</TD><TD>1992</TD><TD>Los Angeles</TD></TR>
tokens = ['AAA', '1992', 'Los Angeles'] 0 54

Line = <TR><TD>BBB</TD><TD>2007</TD><TD>Santa Cruz</TD></TR>
tokens = ['BBB', '2007', 'Santa Cruz'] 0 53


*** Blank 1st element - only shows 2 elements - need 3 elements to be
consistent ***

Line = <TR><TD></TD><TD>2001</TD><TD>Santa Cruz</TD></TR>
tokens = ['2001', 'Santa Cruz'] 0 50


Any assistance would be greatly appreciated!

Steve
 
G

Gabriel Genellina

I've picked up the PyParsing module and am trying to figure out how to
do a simple parsing of some HTML source code. My specific problem is
dealing with an <TD></TD> element that is blank.

Sorry for not answering your question exactly, but I'd use
BeautifulSoup instead, it works even if the HTML is not well formed.


--
Gabriel Genellina
Softlab SRL






__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
P

Paul McGuire

Hi All,

I've picked up thePyParsingmodule and am trying to figure out how to
do a simple parsing of some HTML source code. My specific problem is
dealing with an <TD></TD> element that is blank.
Any assistance would be greatly appreciated!

Steve

Just define a default value to be returned for MultiItem if the
Optional expression is not found:

MultiItem = Optional(OneOrMore(dataItem),default="")

Define default to be whatever string you choose.

-- Paul
 
P

Paul McGuire

Hi All,

I've picked up thePyParsingmodule and am trying to figure out how to
do a simple parsing of some HTML source code. My specific problem is
dealing with an <TD></TD> element that is blank.

I'd also suggest use the makeHTMLTags helper module for the TR and TD
tags:

trStart,trEnd = makeHTMLTags("TR")
tdStart,tdEnd = makeHTMLTags("TD")

makeHTMLTags includes a much more robust definition than just
Literal("<tag>"), including recognition of attributes and tolerance of
upper/lower case.

-- Paul
 
S

Steve

Hi Paul!

Thanks for your suggestions on the default value (I didn't know you
could do that!!) and the use of the makeHTMLtags module!

Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top