HTML Parsing

M

mtuller

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike
 
G

Gabriel Genellina

<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
[...]
I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can[...]

Just try harder with BeautifulSoup, should work OK for your use case.
Unfortunately I can't give you an example right now.
 
A

Ayaz Ahmed Khan

"mtuller" typed:
I have also tried Beautiful Soup, but had trouble understanding the
documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'
 
J

John Machin

"mtuller" typed:


As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.



u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :)

C:\junk>type element_soup.py
from xml.etree import cElementTree as ET
import cStringIO

guff = """
<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
</tr>
"""

tree = ET.parse(cStringIO.StringIO(guff))
for elem in tree.getiterator('td'):
key = elem.get('headers')
assert elem[0].tag == 'span'
value = elem[0].text
print repr(key), repr(value)

C:\junk>\python25\python element_soup.py
'col1_1' 'LETTER'
'col2_1' '33,699'
'col3_1' '1.0'

HTH,
John
 
S

Stefan Behnel

John said:
One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :)

Or (as we were talking about the best of both worlds already) use lxml's HTML
parser, which is also capable of parsing pretty disgusting HTML-like tag soup.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,061
Latest member
KetonaraKeto

Latest Threads

Top