What's the best way to parse this HTML tag?


J

John Salerno

I'm using Beautiful Soup to extract some song information from a radio
station's website that lists the songs it plays as it plays them.
Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:

<div class="cmPlaylistContent">
<strong>
<a href="/lsp/t2995/">
Love Without End, Amen
</a>
</strong>
<br/>
<a href="/lsp/a436/">
George Strait
</a>
<br/>
<span class="sprite iconDownload">
</span>
Download Song:
<a href="http://itunes.apple.com/us/album/love-without-end-amen/
id71416?i=71404&amp;uo=4">
iTunes
</a>
|
<a href="http://www.amazon.com/Love-Without-End-Amen/dp/B000V638BQ?
SubscriptionId=1NXYFBZST44V8CCDK182&amp;tag=coxradiointer-20&amp;linkCode=xm2&amp;camp=2025&amp;creative=165953&amp;creativeASIN=B000V638BQ">
Amazon MP3
</a>
<br/>
<span class="sprite iconComments">
Comments  (1)
</span>
<span class="sprite iconVoteUp">
Votes  (1)
</span>
</div>

This is about as far as I can drill down without getting TOO specific.
I simply find the <div> tags with the "cmPlaylistContent" class. This
tag contains both the song title and the artist name, and sometimes
miscellaneous other information as well, like a way to vote for the
song or links to purchase it from iTunes or Amazon.

So my question is, given the above HTML, how can I best extract the
song title and artist name? It SEEMS like they are always the first
two pieces of information in the tag, such that:

for item in div.stripped_strings: print(item)

Love Without End, Amen
George Strait
Download Song:
iTunes
|
Amazon MP3
Comments  (1)
Votes  (1)

and I could simply get the first two items returned by that generator.
It's not quite as clean as I'd like, because I have no idea if
anything could ever be inserted before either of these items, thus
messing it all up.

I also don't want to rely on the <strong> tag, which makes me shudder,
or the <a> tag, because I don't know if they will always have an href.
Ideall, the <a> tag would have also had an attribute that labeled the
title as the title, and the artist as the artist, but alas.....

Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?

Thanks!
 
Ad

Advertisements

R

Roy Smith

John Salerno said:
Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:

<div class="cmPlaylistContent">
<strong>
<a href="/lsp/t2995/">
Love Without End, Amen
</a>
</strong>
<br/>
<a href="/lsp/a436/">
George Strait
</a>
[...]
Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?

Obviously, any attempt at screen scraping is fraught with peril.
Beautiful Soup is a great tool but it doesn't negate the fact that
you've made a pact with the devil. That being said, if I had to guess,
here's your puppy:
<a href="/lsp/t2995/">
Love Without End, Amen
</a>

the thing to look for is an "a" element with an href that starts with
"/lsp/t", where "t" is for "track". Likewise:
<a href="/lsp/a436/">
George Strait
</a>

an href starting with "/lsp/a" is probably an artist link.

You owe the Oracle three helpings of tag soup.
 
J

John Salerno

Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:
<div class="cmPlaylistContent">
 <strong>
  <a href="/lsp/t2995/">
   Love Without End, Amen
  </a>
 </strong>
 <br/>
 <a href="/lsp/a436/">
  George Strait
 </a>
[...]
Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?

Obviously, any attempt at screen scraping is fraught with peril.
Beautiful Soup is a great tool but it doesn't negate the fact that
you've made a pact with the devil.  That being said, if I had to guess,
here's your puppy:
  <a href="/lsp/t2995/">
   Love Without End, Amen
  </a>

the thing to look for is an "a" element with an href that starts with
"/lsp/t", where "t" is for "track".  Likewise:
 <a href="/lsp/a436/">
  George Strait
 </a>

an href starting with "/lsp/a" is probably an artist link.

You owe the Oracle three helpings of tag soup.

Well, I had considered exactly that method, but I don't know for sure
if the titles and names will always have links like that, so I didn't
want to tie my programming to something so specific. But perhaps it's
still better than just taking the first two strings.
 
Ad

Advertisements

R

Roy Smith

John Salerno said:
Well, I had considered exactly that method, but I don't know for sure
if the titles and names will always have links like that, so I didn't
want to tie my programming to something so specific. But perhaps it's
still better than just taking the first two strings.

Such is the nature of screen scraping. For the most part, web pages are
not meant to be parsed. If you decide to go down the road of trying to
extract data from them, all bets are off. You look at the markup, take
your best guess, and go for it.

There's no magic here. Nobody can look at this HTML and come up with
some hard and fast rule for how you're supposed to parse it. And, even
if they could, it's all likely to change tomorrow when the site rolls
out their next UI makeover.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top