BeautifulSoup: problems with parsing a website

M

Marco Hornung

Hy guys,

I'm using the python-framework BeautifulSoup(BS) to parse some
information out of a german soccer-website.
I spend some qualitiy time with the BS-docs, but I couldn't really
figure out how to get what I was looking for.

Here's the deal:
I want to parse the article shown on the website. To do so I want to
use the Tag " <div class="txt_fliesstext">" as a starting-point. When
I have found the Tag I somehow want to get all following "br"-Tags
until there is a new CSS-Class Style is coming up.
I tried several options in the findAll()-command, but nothing seems to
work.(like: soup.findAll('br',attrs={'class':'txt_fliesstext'}, text
=True) - This one comes with a thound addtional Tag that I don't want
to have, or soup.findAll(attrs={'class':'txt_fliesstext'}) - This
gives me a much better Result, but in this case I only get some few
Tags, instead of all the Tags I want)

Any suggestions?
Thanks in advance!

Website:
http://www.bundesliga.de/de/liga/news/2007/index.php?f=94820.php
Some html-code of the website:
<div id="area_headline">
<div class="txt_headline_red">Erst Höhenflug, dann Absturz</
div>
</div>
<div id="area_fliesstext">
<div class="txt_fliesstext_bold">Mit 28 Punkten stand der KSC
nach der Hinrunde sensationell auf Platz 6.</div>
<br><br>
<div class="txt_fliesstext">Doch in der Rückrunde brachen
die Badener regelrecht ein und holten nur noch 15 Zähler.<br />
<br />
43 Punkte reichten am Ende für den 11. Tabellenplatz, ein mehr
als respektables Ergebnis für einen Aufsteiger.<br />
<br />
 
S

Stefan Behnel

Marco said:

.... and girls?

I'm using the python-framework BeautifulSoup(BS) to parse some
information out of a german soccer-website.

consider using lxml.

http://codespeak.net/lxml
I want to parse the article shown on the website.
2007/index.php?f=94820.php")
To do so I want to
use the Tag " <div class="txt_fliesstext">" as a starting-point.
>>> div = tree.xpath('//div[@class = "txt_fliesstext"]')

When
I have found the Tag I somehow want to get all following "br"-Tags

Following? Meaning: after the div?

Or within the div?
until there is a new CSS-Class Style is coming up.

Ok, that's different.
... if el.tag == "br":
... print el.text # or whatever
... elif el.tag == "span" or el.get("class"):
... break

Hope it helps.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top