scraping nested tables with BeautifulSoup

G

Gonzillaaa

I'm trying to get the data on the "Central London Property Price Guide"
box at the left hand side of this page
http://www.findaproperty.com/regi0018.html

I have managed to get the data :) but when I start looking for tables I
only get tables of depth 1 how do I go about accessing inner tables?
same happens for links...

this is what I've go so far

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

for tables in soup('table'):
table = tables('table')
if not table: continue
print table #this returns only 1 table

#this doesn't work at all

nested_table = table('table')
print nested_table

all suggestions welcome
 
K

Kent Johnson

I'm trying to get the data on the "Central London Property Price Guide"
box at the left hand side of this page
http://www.findaproperty.com/regi0018.html

I have managed to get the data :) but when I start looking for tables I
only get tables of depth 1 how do I go about accessing inner tables?
same happens for links...

this is what I've go so far

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

for tables in soup('table'):
table = tables('table')
if not table: continue
print table #this returns only 1 table

There's something fishy here. soup('table') should yield all the tables
in the document, even nested ones. For example, this program:

data = '''
<body>
<table width='100%'>
<tr><td>
<TABLE WIDTH='150'>
<tr><td>Stuff</td></tr>
</table>
</td></tr>
</table>
</body>
'''

from BeautifulSoup import BeautifulSoup as BS

soup = BS(data)
for table in soup('table'):
print table.get('width')


prints:
100%
150

Another tidbit - if I open the page in Firefox and save it, then open
that file into BeautifulSoup, it finds 25 tables and this code finds the
table you want:

from BeautifulSoup import BeautifulSoup
data2 = open('regi0018-firefox.html')
soup = BeautifulSoup(data2)

print len(soup('table'))

priceGuide = soup('table', dict(bgcolor="#e0f0f8", border="0",
cellpadding="2", cellspacing="2", width="150"))[1]
print priceGuide.tr


prints:
25
<tr><td bgcolor="#e0f0f8" valign="top"><font face="Arial"
size="2"><b>Central London Property Price Guide</b></font></td></tr>


Looking at the saved file, Firefox has clearly done some cleanup. So I
think you have to look at why BS is not processing the original data the
way you want. It seems to be choking on something.

Kent
 
G

Gonzillaaa

Hey Kent,

thanks for your reply. how did you exactly save the file in firefox? if
I save the file locally I get the same error.

print len(soup('table')) gives me 4 instead 25
 
K

Kent Johnson

Hey Kent,

thanks for your reply. how did you exactly save the file in firefox? if
I save the file locally I get the same error.

I think I right-clicked on the page and chose "Save page as..."

Here is a program that shows where BS is choking. It finds the last leaf
node in the parse data by descending the last child of each node:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

tag = soup
while hasattr(tag, 'contents') and tag.contents:
tag = tag.contents[-1]

print type(tag)
print tag


It prints:
<class 'BeautifulSoup.NavigableString'>

<!/BUTTONS>

<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=2 WIDTH=100% BGCOLOR=F0F0F0>
<TD ALIGN=left VALIGN=top>
<snip lots more>

So for some reason BS thinks that everything from <!BUTTONS> to the end
is a single string.

Kent
 
G

Gonzillaaa

so it must be the malformed HTML comment that is confusing BS. I might
try different methods to see if I get the same problem...

thanks
 
K

Kent Johnson

Hey Kent,

thanks for your reply. how did you exactly save the file in firefox? if
I save the file locally I get the same error.

The Firefox version, among other things, turns all the funky <!FOO> and
<!/FOO> tags into comments. Here is a way to do the same thing with BS:

import re
from urllib import urlopen
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

# This tells BS to turn <!FOO> into <!-- FOO --> which allows it
# to do a better job parsing this data
fixExclRe = re.compile(r'<!(?!--)([^>]+)>')
BeautifulStoneSoup.PARSER_MASSAGE.append( (fixExclRe, r'<!-- \1 -->') )

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

priceGuide = soup('table', dict(bgcolor="e0f0f8", border="0",
cellpadding="2", cellspacing="2", width="150"))[1]
print priceGuide


Kent
 
G

Gonzillaaa

Thanks Kent that works perfectly.. How can I strip all the HTML and
create easily a dictionary of {location:price} ??
 
K

Kent Johnson

Thanks Kent that works perfectly.. How can I strip all the HTML and
create easily a dictionary of {location:price} ??

This should help:

prices = priceGuide.table

for tr in prices:
print tr.a.string, tr.a.findNext('font').string

Kent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top