fetching webpage and extracting contents

alfonsobaldaserra · Oct 4, 2010

hello

i am trying to write a script which will go to bbc's top 40 pages and
show only intended contents.

i have written a script

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, ">", "bbc.txt" or die "$!\n";
print $bbc $res->decoded_content;
close $bbc;
} else {
die "could not fetch bbc.co.uk\n";
}

open my $bbc, "<", "bbc.txt";
while (<$bbc>) {
print if m!(.*)!;
print if m!(.*)!;
#next unless $_ =~ m[()|()];
#my ($foo) =~ m!(.*)!;
#my ($bar) =~ m!(.*)!;
# print "$foo -> $bar\n";
}

__RESULT__
Tinie Tempah
Written In The Stars
Bruno Mars
Just The Way You Are (Amazing)
Labrinth
Let The Sun Shine
Adele
Make You Feel My Love
Taio Cruz
Dynamite

but i can't figure out

#1 how to parse $res->decoded_content without writing it to a file
because apparently the whole page is a single string

#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars

#3 how to make this work
#next unless $_ =~ m[()|()];
#my ($foo) =~ m!(.*)!;
#my ($bar) =~ m!(.*)!;
# print "$foo -> $bar\n"

appreciate your time gents.

salute

alfonsobaldaserra · Oct 5, 2010

#1 how to parse $res->decoded_content without writing it to a file

because apparently the whole page is a single string

got it fixed by opening a fh to $res->decoded_content

#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars

so the new code is

#!/usr/bin/perl

use strict;
#use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next unless $con =~ m!()|()!;
my ($artist) = $con =~ m!(.*?)!;
my ($track) = $con =~ m!(.*?)!;
print "$artist - $track\n";
}

} else {
die "could not fetch bbc.co.uk\n";
}

but the output is coming as

Tinie Tempah -
- Written In The Stars
Bruno Mars -
- Just The Way You Are (Amazing)
Labrinth -
- Let The Sun Shine
Adele -
- Make You Feel My Love

while it should have been

Tinie Tempah - Written In The Stars
Bruno Mars - Just The Way You Are (Amazing)
Labrinth - Let The Sun Shine
Adele - Make You Feel My Love

i cant figure out why this is happening.

any help guys?

thanku

alfonsobaldaserra · Oct 5, 2010

i got a real bad code working

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!()|()!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!(.*)!) {
print $1, " - ";
} elsif ($con =~ m!(.*)!) {
print $1, "\n";
}
}
}

thank you gents for giving me a chance to do it myself.

though i am still looking for any improvements that you could
suggest

Peter Makholm · Oct 5, 2010

alfonsobaldaserra said:
i got a real bad code working

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";

Don't do this. While possible, it is kind of obscure and shoul in my
opinion only be used when existing interfaces requires a perl file
handle.

Just split the content on newlines if you want to iterate over the
lines.

while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!()|()!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!(.*)!) {
print $1, " - ";
} elsif ($con =~ m!(.*)!) {
print $1, "\n";
}

Don't parse HTML by throwing naive regexpes at the problem. This would
fail horribly if BBC decided to remove unneded newlines from their
content.

}
}

I would rather use one of the existing HTML parsing modules. One
option could be HTML::TreeBuilder. Base on a quick read in the
documentation it would looke something like this:

my $html = HTML::TreeBuilder->new_from_content( $res->decoded_content );
for my $tag ($html->find('span') {
my $class = $tag->attr('class');

if ( $class eq 'artist' ) {
...;
} elsif ( $class eq 'track' ) {
...;
}
}

This would be a much more robust solution. (But I don't parse HTML in
my day to day work, so I might not be uptodate on the current set of
HTML parsers.)

//Makholm

sln · Oct 5, 2010

i got a real bad code working

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!()|()!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!(.*)!) {
print $1, " - ";
} elsif ($con =~ m!(.*)!) {
print $1, "\n";
}
}
}

thank you gents for giving me a chance to do it myself.

though i am still looking for any improvements that you could
suggest

Along the lines of what you are doing, something like below.
-sln
-----------
use strict;
use warnings;

my $string =<<EOHTML;
<html>

Tinie Tempah


Written In The Stars

 Bruno Mars 
Just The Way You Are (Amazing)

Labrinth
Let The Sun Shine

A song by Labrinth
Adele 
Make You Feel My Love
Taio Cruz
Dynamite
<html/>
EOHTML
my $artist;

while ( $string =~
/ 
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
if (length $artist) {
print "$artist - $2\n";
}
$artist = '';
}
}
print "\n";

## Alternate -
##

$artist = '';
my %tracks;

while ( $string =~
/ 
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
push @{ $tracks{$artist} }, $2;
}
}

for $artist (sort keys %tracks) {
print "\n$artist\n";
for my $track ( sort @{ $tracks{$artist} } ) {
print " - $track\n"
}
}

alfonsobaldaserra · Oct 6, 2010

thank you for such beautiful codes sln.

though i am inclined towards peter's advise to use html parsers.
unfortunately, i couldn't get your code to work due to lack of usage
examples of html::treebuilder online.

does anybody happen to know a good html parser with some good examples
online?

Peter Makholm · Oct 6, 2010

alfonsobaldaserra said:
though i am inclined towards peter's advise to use html parsers.
unfortunately, i couldn't get your code to work due to lack of usage
examples of html::treebuilder online.

Huh?

http://www.perlmonks.org/?node_id=280461
http://search.cpan.org/perldoc?HTML::TreeBuilder
http://groups.google.com/group/comp.lang.perl.misc/msg/372b363f0e9be360

//Makholm

alfonsobaldaserra · Oct 21, 2010

Huh?

http://www.perlmonks.org/?node_id=2...roup/comp.lang.perl.misc/msg/372b363f0e9be360

//Makholm

thank you guys

i finally utilised perlmonks link, read a little at cpan at here i am

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Tree;
use LWP::Simple;

my $uri = "http://www.bbc.co.uk/radio1/chart/singles";

my $html = get($uri);
my $tree = HTML::Tree->new();
$tree->parse($html);

my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

again i am wondering if there is a better way to group these two
arrays together instead of the way i did

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

thank you

Peter Makholm · Oct 21, 2010

alfonsobaldaserra said:
my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

again i am wondering if there is a better way to group these two
arrays together instead of the way i did

It all depends on the HTML. But looking at the URL you posted it looks
like you're looke for a structure looking like this:

<a class="artist-link" href="/music/artists/ba7d2626-38ce-4859-8495-bdb5732715c4" id="link-13">
Taio Cruz
Dynamite
</a>

What you could do was to iterate over all the <a class="artist-link>
nodes and then look for the artist and track below this
node. Untested, but something like this:

for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
my $artist = $link->look_down(class => 'artist')->as_text;
my $track = $link->look_down(class => 'track' )->as_text;

print "$artist - $track\n";
}

//Makholm

alfonsobaldaserra · Oct 21, 2010

for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {

my $artist = $link->look_down(class => 'artist')->as_text;
my $track = $link->look_down(class => 'track' )->as_text;

print "$artist - $track\n";

}

//Makholm

thank you again makholm, your code worked sexily without any
modification

LWP::UserAgent and HTTP::Request with basic authentication...	1	Mar 29, 2007
https request failing	2	Sep 18, 2012
Walking a tree and extracting info... Problems	6	Apr 9, 2006
Getting read timeout error while fetching URL thru LWP	0	Feb 25, 2004
pb download file on internet site	9	Apr 24, 2008
Returning specific data from a webpage?	21	Jul 6, 2005
different proxies and multiple requests in LWP::Parallel	0	Sep 15, 2003
Help me to Improve	11	Oct 7, 2011

fetching webpage and extracting contents

alfonsobaldaserra

alfonsobaldaserra

alfonsobaldaserra

Peter Makholm

sln

alfonsobaldaserra

Peter Makholm

alfonsobaldaserra

Peter Makholm

alfonsobaldaserra

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads