scraping nested tables with BeautifulSoup

Discussion in 'Python' started by Gonzillaaa@gmail.com, Apr 4, 2006.

  1. Guest

    I'm trying to get the data on the "Central London Property Price Guide"
    box at the left hand side of this page
    http://www.findaproperty.com/regi0018.html

    I have managed to get the data :) but when I start looking for tables I
    only get tables of depth 1 how do I go about accessing inner tables?
    same happens for links...

    this is what I've go so far

    import sys
    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup

    data = urlopen('http://www.findaproperty.com/regi0018.html').read()
    soup = BeautifulSoup(data)

    for tables in soup('table'):
    table = tables('table')
    if not table: continue
    print table #this returns only 1 table

    #this doesn't work at all

    nested_table = table('table')
    print nested_table

    all suggestions welcome
    , Apr 4, 2006
    #1
    1. Advertising

  2. Kent Johnson Guest

    wrote:
    > I'm trying to get the data on the "Central London Property Price Guide"
    > box at the left hand side of this page
    > http://www.findaproperty.com/regi0018.html
    >
    > I have managed to get the data :) but when I start looking for tables I
    > only get tables of depth 1 how do I go about accessing inner tables?
    > same happens for links...
    >
    > this is what I've go so far
    >
    > import sys
    > from urllib import urlopen
    > from BeautifulSoup import BeautifulSoup
    >
    > data = urlopen('http://www.findaproperty.com/regi0018.html').read()
    > soup = BeautifulSoup(data)
    >
    > for tables in soup('table'):
    > table = tables('table')
    > if not table: continue
    > print table #this returns only 1 table


    There's something fishy here. soup('table') should yield all the tables
    in the document, even nested ones. For example, this program:

    data = '''
    <body>
    <table width='100%'>
    <tr><td>
    <TABLE WIDTH='150'>
    <tr><td>Stuff</td></tr>
    </table>
    </td></tr>
    </table>
    </body>
    '''

    from BeautifulSoup import BeautifulSoup as BS

    soup = BS(data)
    for table in soup('table'):
    print table.get('width')


    prints:
    100%
    150

    Another tidbit - if I open the page in Firefox and save it, then open
    that file into BeautifulSoup, it finds 25 tables and this code finds the
    table you want:

    from BeautifulSoup import BeautifulSoup
    data2 = open('regi0018-firefox.html')
    soup = BeautifulSoup(data2)

    print len(soup('table'))

    priceGuide = soup('table', dict(bgcolor="#e0f0f8", border="0",
    cellpadding="2", cellspacing="2", width="150"))[1]
    print priceGuide.tr


    prints:
    25
    <tr><td bgcolor="#e0f0f8" valign="top"><font face="Arial"
    size="2"><b>Central London Property Price Guide</b></font></td></tr>


    Looking at the saved file, Firefox has clearly done some cleanup. So I
    think you have to look at why BS is not processing the original data the
    way you want. It seems to be choking on something.

    Kent
    Kent Johnson, Apr 4, 2006
    #2
    1. Advertising

  3. Guest

    Hey Kent,

    thanks for your reply. how did you exactly save the file in firefox? if
    I save the file locally I get the same error.

    print len(soup('table')) gives me 4 instead 25
    , Apr 4, 2006
    #3
  4. Kent Johnson Guest

    wrote:
    > Hey Kent,
    >
    > thanks for your reply. how did you exactly save the file in firefox? if
    > I save the file locally I get the same error.


    I think I right-clicked on the page and chose "Save page as..."

    Here is a program that shows where BS is choking. It finds the last leaf
    node in the parse data by descending the last child of each node:

    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup

    data = urlopen('http://www.findaproperty.com/regi0018.html').read()
    soup = BeautifulSoup(data)

    tag = soup
    while hasattr(tag, 'contents') and tag.contents:
    tag = tag.contents[-1]

    print type(tag)
    print tag


    It prints:
    <class 'BeautifulSoup.NavigableString'>

    <!/BUTTONS>

    <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=2 WIDTH=100% BGCOLOR=F0F0F0>
    <TD ALIGN=left VALIGN=top>
    <snip lots more>

    So for some reason BS thinks that everything from <!BUTTONS> to the end
    is a single string.

    Kent
    Kent Johnson, Apr 4, 2006
    #4
  5. Guest

    so it must be the malformed HTML comment that is confusing BS. I might
    try different methods to see if I get the same problem...

    thanks
    , Apr 4, 2006
    #5
  6. Kent Johnson Guest

    wrote:
    > Hey Kent,
    >
    > thanks for your reply. how did you exactly save the file in firefox? if
    > I save the file locally I get the same error.


    The Firefox version, among other things, turns all the funky <!FOO> and
    <!/FOO> tags into comments. Here is a way to do the same thing with BS:

    import re
    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

    # This tells BS to turn <!FOO> into <!-- FOO --> which allows it
    # to do a better job parsing this data
    fixExclRe = re.compile(r'<!(?!--)([^>]+)>')
    BeautifulStoneSoup.PARSER_MASSAGE.append( (fixExclRe, r'<!-- \1 -->') )

    data = urlopen('http://www.findaproperty.com/regi0018.html').read()
    soup = BeautifulSoup(data)

    priceGuide = soup('table', dict(bgcolor="e0f0f8", border="0",
    cellpadding="2", cellspacing="2", width="150"))[1]
    print priceGuide


    Kent
    Kent Johnson, Apr 4, 2006
    #6
  7. Guest

    Thanks Kent that works perfectly.. How can I strip all the HTML and
    create easily a dictionary of {location:price} ??
    , Apr 4, 2006
    #7
  8. Kent Johnson Guest

    wrote:
    > Thanks Kent that works perfectly.. How can I strip all the HTML and
    > create easily a dictionary of {location:price} ??


    This should help:

    prices = priceGuide.table

    for tr in prices:
    print tr.a.string, tr.a.findNext('font').string

    Kent
    Kent Johnson, Apr 4, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Bassett
    Replies:
    3
    Views:
    912
    Augustus
    Aug 15, 2003
  2. Otuatail

    Tables within tables

    Otuatail, Jul 31, 2004, in forum: HTML
    Replies:
    7
    Views:
    483
  3. David Jones

    Web Scraping/Site Scraping

    David Jones, Jul 11, 2004, in forum: Python
    Replies:
    4
    Views:
    493
    Andrew Bennetts
    Jul 13, 2004
  4. Chris Brat
    Replies:
    5
    Views:
    687
    =?iso-8859-1?q?Luis_M._Gonz=E1lez?=
    Aug 22, 2006
  5. Replies:
    2
    Views:
    635
    clurks
    Sep 22, 2008
Loading...

Share This Page