scraping nested tables with BeautifulSoup

Gonzillaaa · Apr 4, 2006

I'm trying to get the data on the "Central London Property Price Guide"
box at the left hand side of this page
http://www.findaproperty.com/regi0018.html

I have managed to get the data

but when I start looking for tables I
only get tables of depth 1 how do I go about accessing inner tables?
same happens for links...

this is what I've go so far

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

for tables in soup('table'):
table = tables('table')
if not table: continue
print table #this returns only 1 table

#this doesn't work at all

nested_table = table('table')
print nested_table

all suggestions welcome

Kent Johnson · Apr 4, 2006

I'm trying to get the data on the "Central London Property Price Guide"
box at the left hand side of this page
http://www.findaproperty.com/regi0018.html

I have managed to get the data but when I start looking for tables I
only get tables of depth 1 how do I go about accessing inner tables?
same happens for links...

this is what I've go so far

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

for tables in soup('table'):
table = tables('table')
if not table: continue
print table #this returns only 1 table

There's something fishy here. soup('table') should yield all the tables
in the document, even nested ones. For example, this program:

data = '''
<body>
<table width='100%'>
<tr><td>
<TABLE WIDTH='150'>
<tr><td>Stuff</td></tr>
</table>
</td></tr>
</table>
</body>
'''

from BeautifulSoup import BeautifulSoup as BS

soup = BS(data)
for table in soup('table'):
print table.get('width')

prints:
100%
150

Another tidbit - if I open the page in Firefox and save it, then open
that file into BeautifulSoup, it finds 25 tables and this code finds the
table you want:

from BeautifulSoup import BeautifulSoup
data2 = open('regi0018-firefox.html')
soup = BeautifulSoup(data2)

print len(soup('table'))

priceGuide = soup('table', dict(bgcolor="#e0f0f8", border="0",
cellpadding="2", cellspacing="2", width="150"))[1]
print priceGuide.tr

prints:
25
<tr><td bgcolor="#e0f0f8" valign="top"><font face="Arial"
size="2"><b>Central London Property Price Guide</b></font></td></tr>

Looking at the saved file, Firefox has clearly done some cleanup. So I
think you have to look at why BS is not processing the original data the
way you want. It seems to be choking on something.

Kent

Gonzillaaa · Apr 4, 2006

Hey Kent,

thanks for your reply. how did you exactly save the file in firefox? if
I save the file locally I get the same error.

print len(soup('table')) gives me 4 instead 25

Kent Johnson · Apr 4, 2006

Hey Kent,

thanks for your reply. how did you exactly save the file in firefox? if
I save the file locally I get the same error.

I think I right-clicked on the page and chose "Save page as..."

Here is a program that shows where BS is choking. It finds the last leaf
node in the parse data by descending the last child of each node:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

tag = soup
while hasattr(tag, 'contents') and tag.contents:
tag = tag.contents[-1]

print type(tag)
print tag

It prints:
<class 'BeautifulSoup.NavigableString'>

<!/BUTTONS>

<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=2 WIDTH=100% BGCOLOR=F0F0F0>
<TD ALIGN=left VALIGN=top>
<snip lots more>

So for some reason BS thinks that everything from <!BUTTONS> to the end
is a single string.

Kent

Gonzillaaa · Apr 4, 2006

so it must be the malformed HTML comment that is confusing BS. I might
try different methods to see if I get the same problem...

thanks

Kent Johnson · Apr 4, 2006

Hey Kent,

thanks for your reply. how did you exactly save the file in firefox? if
I save the file locally I get the same error.

The Firefox version, among other things, turns all the funky <!FOO> and
<!/FOO> tags into comments. Here is a way to do the same thing with BS:

import re
from urllib import urlopen
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

# This tells BS to turn <!FOO> into  which allows it
# to do a better job parsing this data
fixExclRe = re.compile(r'<!(?!--)([^>]+)>')
BeautifulStoneSoup.PARSER_MASSAGE.append( (fixExclRe, r'') )

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

priceGuide = soup('table', dict(bgcolor="e0f0f8", border="0",
cellpadding="2", cellspacing="2", width="150"))[1]
print priceGuide

Kent

Gonzillaaa · Apr 4, 2006

Thanks Kent that works perfectly.. How can I strip all the HTML and
create easily a dictionary of {location

rice} ??

Kent Johnson · Apr 4, 2006

Thanks Kent that works perfectly.. How can I strip all the HTML and
create easily a dictionary of {locationrice} ??

This should help:

prices = priceGuide.table

for tr in prices:
print tr.a.string, tr.a.findNext('font').string

Kent

Need help with this code	2	May 10, 2023
help with for loop----python 2.7.2	9	Mar 22, 2014
Parsing html with Beautifulsoup	0	Dec 10, 2009
BeautifulSoup and Problem Tables	2	Sep 21, 2008
Extracting text using Beautifulsoup	0	Oct 25, 2009
cannot get html content of tag with BeautifulSoup	1	Jun 18, 2010
Having trouble with some lists in BeautifulSoup	1	Jul 16, 2008
Help with using findAll() in BeautifulSoup	2	Jul 12, 2008

scraping nested tables with BeautifulSoup

Gonzillaaa

Kent Johnson

Gonzillaaa

Kent Johnson

Gonzillaaa

Kent Johnson

Gonzillaaa

Kent Johnson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads