python-parser running Beautiful Soup needs to be reviewed

Discussion in 'Python' started by Martin Kaspar, Dec 11, 2010.

  1. Hello commnity

    i am new to Python and to Beatiful Soup also!
    It is told to be a great tool to parse and extract content. So here i
    am...:

    I want to take the content of a <td>-tag of a table in a html
    document. For example, i have this table

    <table class="bp_ergebnis_tab_info">
    <tr>
    <td>
    This is a sample text
    </td>

    <td>
    This is the second sample text
    </td>
    </tr>
    </table>

    How can i use beautifulsoup to take the text "This is a sample text"?

    Should i make use
    soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
    the whole table.

    See the target http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323

    Well - what have we to do first:

    The first thing is t o find the table:

    i do this with Using find rather than findall returns the first item
    in the list
    (rather than returning a list of all finds - in which case we'd have
    to add an extra [0]
    to take the first element of the list):


    table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})

    Then use find again to find the first td:

    first_td = soup.find('td')

    Then we have to use renderContents() to extract the textual contents:

    text = first_td.renderContents()

    .... and the job is done (though we may also want to use strip() to
    remove leading and trailing spaces:

    trimmed_text = text.strip()

    This should give us:


    print trimmed_text
    This is a sample text

    as desired.


    What do you think about the code? I love to hear from you!?

    greetings
    matze
    Martin Kaspar, Dec 11, 2010
    #1
    1. Advertising

  2. Martin Kaspar

    Stef Mientki Guest

    On 11-12-2010 17:24, Martin Kaspar wrote:
    > Hello commnity
    >
    > i am new to Python and to Beatiful Soup also!
    > It is told to be a great tool to parse and extract content. So here i
    > am...:
    >
    > I want to take the content of a <td>-tag of a table in a html
    > document. For example, i have this table
    >
    > <table class="bp_ergebnis_tab_info">
    > <tr>
    > <td>
    > This is a sample text
    > </td>
    >
    > <td>
    > This is the second sample text
    > </td>
    > </tr>
    > </table>
    >
    > How can i use beautifulsoup to take the text "This is a sample text"?
    >
    > Should i make use
    > soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
    > the whole table.
    >
    > See the target http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323
    >
    > Well - what have we to do first:
    >
    > The first thing is t o find the table:
    >
    > i do this with Using find rather than findall returns the first item
    > in the list
    > (rather than returning a list of all finds - in which case we'd have
    > to add an extra [0]
    > to take the first element of the list):
    >
    >
    > table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
    >
    > Then use find again to find the first td:
    >
    > first_td = soup.find('td')
    >
    > Then we have to use renderContents() to extract the textual contents:
    >
    > text = first_td.renderContents()
    >
    > ... and the job is done (though we may also want to use strip() to
    > remove leading and trailing spaces:
    >
    > trimmed_text = text.strip()
    >
    > This should give us:
    >
    >
    > print trimmed_text
    > This is a sample text
    >
    > as desired.
    >
    >
    > What do you think about the code? I love to hear from you!?

    I've no opinion.
    I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen ;-)

    So the simplest solution I came up with:

    Text = """
    <table class="bp_ergebnis_tab_info">
    <tr>
    <td>
    This is a sample text
    </td>

    <td>
    This is the second sample text
    </td>
    </tr>
    </table>
    """
    Content = BeautifulSoup ( Text )
    print Content.find('td').contents[0].strip()
    >>> This is a sample text


    And now I wonder how to get the next contents !!

    cheers,
    Stef
    > greetings
    > matze
    Stef Mientki, Dec 11, 2010
    #2
    1. Advertising

  3. On Sat, 11 Dec 2010 22:38:43 +0100, Stef Mientki wrote:
    [snip]
    > So the simplest solution I came up with:
    >
    > Text = """
    ><table class="bp_ergebnis_tab_info">
    > <tr>
    > <td>
    > This is a sample text
    > </td>
    >
    > <td>
    > This is the second sample text
    > </td>
    > </tr>
    ></table>
    > """
    > Content = BeautifulSoup ( Text )
    > print Content.find('td').contents[0].strip()
    >>>> This is a sample text

    >
    > And now I wonder how to get the next contents !!


    Here's a suggestion:

    peter@eleodes:~$ python
    Python 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
    [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from BeautifulSoup import BeautifulSoup
    >>> Text = """

    .... <table class="bp_ergebnis_tab_info">
    .... <tr>
    .... <td>
    .... This is a sample text
    .... </td>
    ....
    .... <td>
    .... This is the second sample text
    .... </td>
    .... </tr>
    .... </table>
    .... """
    >>> Content = BeautifulSoup ( Text )
    >>> for xx in Content.findAll('td'):

    .... print xx.contents[0].strip()
    ....
    This is a sample text
    This is the second sample text
    >>>


    --
    To email me, substitute nowhere->spamcop, invalid->net.
    Peter Pearson, Dec 11, 2010
    #3
  4. On 11.12.2010 22:38, Stef Mientki wrote:
    > On 11-12-2010 17:24, Martin Kaspar wrote:
    >> Hello commnity
    >>
    >> i am new to Python and to Beatiful Soup also!
    >> It is told to be a great tool to parse and extract content. So here i
    >> am...:
    >>
    >> I want to take the content of a<td>-tag of a table in a html
    >> document. For example, i have this table
    >>
    >> <table class="bp_ergebnis_tab_info">
    >> <tr>
    >> <td>
    >> This is a sample text
    >> </td>
    >>
    >> <td>
    >> This is the second sample text
    >> </td>
    >> </tr>
    >> </table>
    >>
    >> How can i use beautifulsoup to take the text "This is a sample text"?
    >>
    >> Should i make use
    >> soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
    >> the whole table.
    >>
    >> See the target http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323
    >>
    >> Well - what have we to do first:
    >>
    >> The first thing is t o find the table:
    >>
    >> i do this with Using find rather than findall returns the first item
    >> in the list
    >> (rather than returning a list of all finds - in which case we'd have
    >> to add an extra [0]
    >> to take the first element of the list):
    >>
    >>
    >> table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
    >>
    >> Then use find again to find the first td:
    >>
    >> first_td = soup.find('td')
    >>
    >> Then we have to use renderContents() to extract the textual contents:
    >>
    >> text = first_td.renderContents()
    >>
    >> ... and the job is done (though we may also want to use strip() to
    >> remove leading and trailing spaces:
    >>
    >> trimmed_text = text.strip()
    >>
    >> This should give us:
    >>
    >>
    >> print trimmed_text
    >> This is a sample text
    >>
    >> as desired.
    >>
    >>
    >> What do you think about the code? I love to hear from you!?

    > I've no opinion.
    > I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen ;-)


    Really? While I'm by no means an expert, I find it very easy to work
    with. It's very well structured IMHO.

    > So the simplest solution I came up with:
    >
    > Text = """
    > <table class="bp_ergebnis_tab_info">
    > <tr>
    > <td>
    > This is a sample text
    > </td>
    >
    > <td>
    > This is the second sample text
    > </td>
    > </tr>
    > </table>
    > """
    > Content = BeautifulSoup ( Text )
    > print Content.find('td').contents[0].strip()
    >>>> This is a sample text

    >
    > And now I wonder how to get the next contents !!


    Content = BeautifulSoup ( Text )
    for td in Content.findAll('td'):
    print td.string.strip() # or td.renderContents().strip()
    Alexander Kapps, Dec 11, 2010
    #4
  5. Martin Kaspar

    Stef Mientki Guest

    I've no opinion.
    >> I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen ;-)

    >
    > Really? While I'm by no means an expert, I find it very easy to work with. It's very well
    > structured IMHO.

    I think the cause lies in the documentation.
    The PySide documentation is much easier to understand (at least for me)

    http://www.pyside.org/docs/pyside/PySide/QtWebKit/QWebElement.html

    cheers,
    Stef
    Stef Mientki, Dec 12, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    528
    Enigma Curry
    Mar 11, 2006
  2. Tempo

    Using Beautiful Soup

    Tempo, Aug 19, 2006, in forum: Python
    Replies:
    1
    Views:
    531
    Jorge Godoy
    Aug 19, 2006
  3. Francach
    Replies:
    15
    Views:
    711
    George Sakkis
    Sep 21, 2006
  4. Martin Kaspar
    Replies:
    1
    Views:
    455
    John Nagle
    Dec 25, 2010
  5. Simon Evans
    Replies:
    41
    Views:
    165
    Rustom Mody
    May 15, 2014
Loading...

Share This Page