HTML Parsing

mtuller · Feb 10, 2007

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike

Gabriel Genellina · Feb 10, 2007

En Sat said:
<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
[...]
I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can[...]

Just try harder with BeautifulSoup, should work OK for your use case.
Unfortunately I can't give you an example right now.

Ayaz Ahmed Khan · Feb 11, 2007

"mtuller" typed:

I have also tried Beautiful Soup, but had trouble understanding the
documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

John Machin · Feb 11, 2007

"mtuller" typed:

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup

C:\junk>type element_soup.py
from xml.etree import cElementTree as ET
import cStringIO

guff = """
<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
</tr>
"""

tree = ET.parse(cStringIO.StringIO(guff))
for elem in tree.getiterator('td'):
key = elem.get('headers')
assert elem[0].tag == 'span'
value = elem[0].text
print repr(key), repr(value)

C:\junk>\python25\python element_soup.py
'col1_1' 'LETTER'
'col2_1' '33,699'
'col3_1' '1.0'

HTH,
John

Fredrik Lundh · Feb 11, 2007

John said:
One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup

or get the best of both worlds:

http://effbot.org/zone/element-soup.htm

</F>

Stefan Behnel · Feb 25, 2007

John said:
One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup

Or (as we were talking about the best of both worlds already) use lxml's HTML
parser, which is also capable of parsing pretty disgusting HTML-like tag soup.

Stefan

Parsing HTML	3	Feb 10, 2007
Sort by number of characters	1	Nov 2, 2023
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
HTML Table Issue	1	Aug 29, 2022
How to have two html audio players on one page?	0	May 3, 2022
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023

HTML Parsing

mtuller

Gabriel Genellina

Ayaz Ahmed Khan

John Machin

Fredrik Lundh

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads