BeautifulSoup: problems with parsing a website

Marco Hornung · May 28, 2008

Hy guys,

I'm using the python-framework BeautifulSoup(BS) to parse some
information out of a german soccer-website.
I spend some qualitiy time with the BS-docs, but I couldn't really
figure out how to get what I was looking for.

Here's the deal:
I want to parse the article shown on the website. To do so I want to
use the Tag " <div class="txt_fliesstext">" as a starting-point. When
I have found the Tag I somehow want to get all following "br"-Tags
until there is a new CSS-Class Style is coming up.
I tried several options in the findAll()-command, but nothing seems to
work.(like: soup.findAll('br',attrs={'class':'txt_fliesstext'}, text
=True) - This one comes with a thound addtional Tag that I don't want
to have, or soup.findAll(attrs={'class':'txt_fliesstext'}) - This
gives me a much better Result, but in this case I only get some few
Tags, instead of all the Tags I want)

Any suggestions?
Thanks in advance!

Website:
http://www.bundesliga.de/de/liga/news/2007/index.php?f=94820.php
Some html-code of the website:
<div id="area_headline">
<div class="txt_headline_red">Erst Höhenflug, dann Absturz</
div>
</div>
<div id="area_fliesstext">
<div class="txt_fliesstext_bold">Mit 28 Punkten stand der KSC
nach der Hinrunde sensationell auf Platz 6.</div>
<br><br>
<div class="txt_fliesstext">Doch in der Rückrunde brachen
die Badener regelrecht ein und holten nur noch 15 Zähler.<br />
<br />
43 Punkte reichten am Ende für den 11. Tabellenplatz, ein mehr
als respektables Ergebnis für einen Aufsteiger.<br />
<br />

Stefan Behnel · May 28, 2008

Marco said:
Hy guys,

.... and girls?

I'm using the python-framework BeautifulSoup(BS) to parse some
information out of a german soccer-website.

consider using lxml.

http://codespeak.net/lxml

I want to parse the article shown on the website.

2007/index.php?f=94820.php")

To do so I want to
use the Tag " <div class="txt_fliesstext">" as a starting-point.

>>> div = tree.xpath('//div[@class = "txt_fliesstext"]')

Click to expand...

Click to expand...

When
I have found the Tag I somehow want to get all following "br"-Tags

Following? Meaning: after the div?

Or within the div?

until there is a new CSS-Class Style is coming up.

Ok, that's different.
... if el.tag == "br":
... print el.text # or whatever
... elif el.tag == "span" or el.get("class"):
... break

Hope it helps.

Stefan

Removing tags with BeautifulSoup	0	Aug 8, 2007
XSLT, HTML to XML, understanding external Website	0	Jul 15, 2012
python-parser running Beautiful Soup needs to be reviewed	4	Dec 11, 2010
Parsing an HTML a tag	10	Sep 24, 2005
What's the best way to parse this HTML tag?	3	Mar 11, 2012
Problems with margins, paddings, divs and floats! PLEASE HELP!	7	May 9, 2007
What is causing error with this call to a .js file	4	May 27, 2008
How do I use a variable or parameter in an <xsl:if> expression to compare it with the value of an xm	2	Dec 3, 2007

BeautifulSoup: problems with parsing a website

Marco Hornung

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads