Beautiful Soup iterator question....

Discussion in 'Python' started by cjl, Apr 20, 2007.

  1. cjl

    cjl Guest

    P:

    I am screen-scraping a table. The table has an unknown number of rows,
    but each row has exactly 8 cells. I would like to extract the data
    from the cells, but the first three cells in each row have their data
    nested inside other tags.

    So I have the following code:

    for row in table.findAll("tr"):
    for cell in row.findAll("td"):
    print cell.contents[0]

    This code prints out all the data, but of course the first three cells
    still contain their unwanted tags.

    I would like to do something like this:

    for cell1, cell2, cell3, cell4, cell5, cell6, cell7, cell8 in
    row.findAll("td"):

    Then treat each cell differently.

    I can't figure this out. Can anyone point me in the right direction?

    -CJL
     
    cjl, Apr 20, 2007
    #1
    1. Advertising

  2. cjl

    Steve Holden Guest

    cjl wrote:
    > P:
    >
    > I am screen-scraping a table. The table has an unknown number of rows,
    > but each row has exactly 8 cells. I would like to extract the data
    > from the cells, but the first three cells in each row have their data
    > nested inside other tags.
    >
    > So I have the following code:
    >
    > for row in table.findAll("tr"):
    > for cell in row.findAll("td"):
    > print cell.contents[0]
    >
    > This code prints out all the data, but of course the first three cells
    > still contain their unwanted tags.
    >
    > I would like to do something like this:
    >
    > for cell1, cell2, cell3, cell4, cell5, cell6, cell7, cell8 in
    > row.findAll("td"):
    >
    > Then treat each cell differently.
    >
    > I can't figure this out. Can anyone point me in the right direction?
    >

    did you try something like (untested)

    cell1, cell2, cell3, cell4, cell5, \
    cell6, cell7, cell8 = row.findAll("td")

    No need for the "for" if you want to handle each cell differently, you
    won;t be iterating over htem . And, as you saw, it doesn't work unless
    row.findAll(...) returns a sequence of eight-item containers.

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    Recent Ramblings http://holdenweb.blogspot.com
     
    Steve Holden, Apr 20, 2007
    #2
    1. Advertising

  3. cjl

    Paul McGuire Guest

    On Apr 20, 2:05 pm, Steve Holden <> wrote:
    <snip>
    >
    > did you try something like (untested)
    >
    > cell1, cell2, cell3, cell4, cell5, \
    > cell6, cell7, cell8 = row.findAll("td")
    >
    > No need for the "for" if you want to handle each cell differently, you
    > won;t be iterating over htem . And, as you saw, it doesn't work unless
    > row.findAll(...) returns a sequence of eight-item containers.
    >


    One defensive approach to handle rows that might have too few or too
    many elements, is to construct a larger list, and then slice the right
    number of elements from it.

    cell1, cell2, cell3, cell4, cell5, \
    cell6, cell7, cell8 = (row.findAll("td") + [None]*8)[:
    8]

    -- Paul
     
    Paul McGuire, Apr 20, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    566
    Enigma Curry
    Mar 11, 2006
  2. Tempo

    Using Beautiful Soup

    Tempo, Aug 19, 2006, in forum: Python
    Replies:
    1
    Views:
    619
    Jorge Godoy
    Aug 19, 2006
  3. Francach
    Replies:
    15
    Views:
    751
    George Sakkis
    Sep 21, 2006
  4. PicURLPy
    Replies:
    3
    Views:
    1,241
    David Coffin
    Dec 4, 2006
  5. Tess
    Replies:
    5
    Views:
    461
    Stefan Behnel
    Mar 25, 2008
Loading...

Share This Page