Parsing HTML

mtuller · Feb 10, 2007

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike

Samuel Karl Peterson · Feb 11, 2007

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:
[snip]

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database.
[snip]

I have also tried Beautiful Soup, but had trouble understanding the
documentation.

====================
from BeautifulSoup import BeautifulSoup as parser

soup = parser("""<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>""")

value = \
int(soup.find('td', headers='col2_1').span.contents[0].replace(',', ''))
====================

Thanks,

Mike

Hope that helped. This code assumes there aren't any td tags with
header=col2_1 that come before the value you are trying to extract.
There's several ways to do things in BeautifulSoup. You should play
around with BeautifulSoup in the interactive prompt. It's simply
awesome if you don't need speed on your side.

Paul McGuire · Feb 11, 2007

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

So what you are saying is that you need to make your pattern more
specific. So I suggest adding these items to your matching pattern:
- only match span if inside a <td> with attribute 'headers="col2_1"'
- only match if the span body is an integer (with optional comma
separater for thousands)

This grammar adds these more specific tests for matching the input
HTML (note also the use of results names to make it easy to extract
the integer number, and a parse action added to integer to convert the
'33,699' string to the integer 33699).

-- Paul

htmlSource = """<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>"""

from pyparsing import makeHTMLTags, Word, nums, ParseException

tdStart, tdEnd = makeHTMLTags('td')
spanStart, spanEnd = makeHTMLTags('span')

def onlyAcceptWithTagAttr(attrname,attrval):
def action(tagAttrs):
if not(attrname in tagAttrs and tagAttrs[attrname]==attrval):
raise ParseException("",0,"")
return action

tdStart.setParseAction(onlyAcceptWithTagAttr("headers","col2_1"))
spanStart.setParseAction(onlyAcceptWithTagAttr("class","hpPageText"))

integer = Word(nums,nums+',')
integer.setParseAction(lambda t:int("".join(c for c in t[0] if c !=
',')))

patt = tdStart + spanStart + integer.setResultsName("intValue") +
spanEnd + tdEnd

for matches in patt.searchString(htmlSource):
print matches.intValue

prints:
33699

Frederic Rentsch · Feb 14, 2007

mtuller said:
Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike

Posted problems rarely provide exhaustive information. It's just not
possible. I have been taking shots in the dark of late suggesting a
stream-editing approach to extracting data from htm files. The
mainstream approach is to use a parser (beautiful soup or pyparsing).
Often times nothing more is attempted than the location and
extraction of some text irrespective of page layout. This can sometimes
be done with a simple regular expression, or with a stream editor if a
regular expression gets too unwieldy. The advantage of the stream editor
over a parser is that it doesn't mobilize an arsenal of unneeded
functionality and therefore tends to be easier, faster and shorter to
implement. The editor's inability to understand structure isn't a
shortcoming when structure doesn't matter and can even be an advantage
in the presence of malformed input that sends a parser on a tough and
potentially hazardous mission for no purpose at all.
SE doesn't impose the study of massive documentation, nor the
memorization of dozens of classes, methods and what not. The following
four lines would solve the OP's problem (provided the post really is all
there is to the problem):

>>> import re, SE # http://cheeseshop.python.org/pypi/SE/2.3
>>> Filter = SE.SE ('<EAT> "~(?i)col[0-9]_[0-9](.|\n)*?/td>~==SOME

Click to expand...

Click to expand...

SPLIT MARK"')

>>> r = re.compile ('(?i)(col[0-9]_[0-9])(.|\n)*?([0-9,]+)</span')
>>> for line in Filter (s).split ('SOME SPLIT MARK'):

Click to expand...

Click to expand...

print r.search (line).group (1, 3)

('col2_1', '33,699')
('col3_1', '0')
('col4_1', '7,428')

-----------------------------------------------------------------------

Input:
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col5_1" style="width:13%; text-align:right" >
7,428</td>
</tr>'''

The SE object handles file input too:
'' commands string output
print r.search (line).group (1, 3)

HTML Parsing	5	Feb 10, 2007
Sort by number of characters	1	Nov 2, 2023
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
HTML Table Issue	1	Aug 29, 2022
How to have two html audio players on one page?	0	May 3, 2022
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023

Parsing HTML

mtuller

Samuel Karl Peterson

Paul McGuire

Frederic Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads