fetching webpage and extracting contents

Discussion in 'Perl Misc' started by alfonsobaldaserra, Oct 4, 2010.

  1. hello

    i am trying to write a script which will go to bbc's top 40 pages and
    show only intended contents.

    i have written a script

    #!/usr/bin/perl

    use strict;
    use warnings;
    use LWP::UserAgent;

    my $ua = LWP::UserAgent->new;
    $ua->timeout(10);
    $ua->env_proxy;

    my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

    if ($res->is_success) {
    open my $bbc, ">", "bbc.txt" or die "$!\n";
    print $bbc $res->decoded_content;
    close $bbc;
    } else {
    die "could not fetch bbc.co.uk\n";
    }

    open my $bbc, "<", "bbc.txt";
    while (<$bbc>) {
    print if m!<span class="artist">(.*)</span>!;
    print if m!<span class="track">(.*)</span>!;
    #next unless $_ =~ m[(<span class="artist">)|(<span
    class="track">)];
    #my ($foo) =~ m!<span class="artist">(.*)</span>!;
    #my ($bar) =~ m!<span class="track">(.*)</span>!;
    # print "$foo -> $bar\n";
    }

    __RESULT__
    <span class="artist">Tinie Tempah</span>
    <span class="track">Written In The Stars</span>
    <span class="artist">Bruno Mars</span>
    <span class="track">Just The Way You Are (Amazing)</span>
    <span class="artist">Labrinth</span>
    <span class="track">Let The Sun Shine</span>
    <span class="artist">Adele</span>
    <span class="track">Make You Feel My Love</span>
    <span class="artist">Taio Cruz</span>
    <span class="track">Dynamite</span>



    but i can't figure out

    #1 how to parse $res->decoded_content without writing it to a file
    because apparently the whole page is a single string

    #2 how to show data in artist - track format, like
    Tinie Tempah - Written In The Stars

    #3 how to make this work
    #next unless $_ =~ m[(<span class="artist">)|(<span
    class="track">)];
    #my ($foo) =~ m!<span class="artist">(.*)</span>!;
    #my ($bar) =~ m!<span class="track">(.*)</span>!;
    # print "$foo -> $bar\n"

    appreciate your time gents.

    salute :)
     
    alfonsobaldaserra, Oct 4, 2010
    #1
    1. Advertising

  2. > #1 how to parse $res->decoded_content without writing it to a file
    > because apparently the whole page is a single string


    got it fixed by opening a fh to $res->decoded_content

    > #2 how to show data in artist - track format, like
    > Tinie Tempah - Written In The Stars



    so the new code is

    #!/usr/bin/perl

    use strict;
    #use warnings;
    use LWP::UserAgent;

    my $ua = LWP::UserAgent->new;
    $ua->timeout(10);
    $ua->env_proxy;

    my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

    if ($res->is_success) {
    open my $bbc, "<", \$res->decoded_content or die "$!\n";
    while (defined (my $con = <$bbc>)) {
    chomp $con;
    next unless $con =~ m!(<span class="artist">)|(<span
    class="track">)!;
    my ($artist) = $con =~ m!<span class="artist">(.*?)</
    span>!;
    my ($track) = $con =~ m!<span class="track">(.*?)</
    span>!;
    print "$artist - $track\n";
    }

    } else {
    die "could not fetch bbc.co.uk\n";
    }


    but the output is coming as

    Tinie Tempah -
    - Written In The Stars
    Bruno Mars -
    - Just The Way You Are (Amazing)
    Labrinth -
    - Let The Sun Shine
    Adele -
    - Make You Feel My Love

    while it should have been

    Tinie Tempah - Written In The Stars
    Bruno Mars - Just The Way You Are (Amazing)
    Labrinth - Let The Sun Shine
    Adele - Make You Feel My Love

    i cant figure out why this is happening.

    any help guys?

    thanku :)
     
    alfonsobaldaserra, Oct 5, 2010
    #2
    1. Advertising

  3. i got a real bad code working :)

    #!/usr/bin/perl

    use strict;
    use warnings;
    use LWP::UserAgent;

    my $ua = LWP::UserAgent->new;
    $ua->timeout(10);
    $ua->env_proxy;

    my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

    if ($res->is_success) {
    open my $bbc, "<", \$res->decoded_content or die "$!\n";
    while (defined (my $con = <$bbc>)) {
    chomp $con;
    next if $con =~ /^\s*$/;
    next unless $con =~ m!(<span class="artist">)|(<span
    class="track">)!;
    $con =~ s/^\s*|\s*$//g;
    if ($con =~ m!<span class="artist">(.*)</span>!) {
    print $1, " - ";
    } elsif ($con =~ m!<span class="track">(.*)</span>!) {
    print $1, "\n";
    }
    }
    }


    thank you gents for giving me a chance to do it myself.

    though i am still looking for any improvements that you could
    suggest :)
     
    alfonsobaldaserra, Oct 5, 2010
    #3
  4. alfonsobaldaserra <> writes:

    > i got a real bad code working :)
    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    > use LWP::UserAgent;
    >
    > my $ua = LWP::UserAgent->new;
    > $ua->timeout(10);
    > $ua->env_proxy;
    >
    > my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
    >
    > if ($res->is_success) {
    > open my $bbc, "<", \$res->decoded_content or die "$!\n";


    Don't do this. While possible, it is kind of obscure and shoul in my
    opinion only be used when existing interfaces requires a perl file
    handle.

    Just split the content on newlines if you want to iterate over the
    lines.

    > while (defined (my $con = <$bbc>)) {
    > chomp $con;
    > next if $con =~ /^\s*$/;
    > next unless $con =~ m!(<span class="artist">)|(<span
    > class="track">)!;
    > $con =~ s/^\s*|\s*$//g;
    > if ($con =~ m!<span class="artist">(.*)</span>!) {
    > print $1, " - ";
    > } elsif ($con =~ m!<span class="track">(.*)</span>!) {
    > print $1, "\n";
    > }


    Don't parse HTML by throwing naive regexpes at the problem. This would
    fail horribly if BBC decided to remove unneded newlines from their
    content.

    > }
    > }


    I would rather use one of the existing HTML parsing modules. One
    option could be HTML::TreeBuilder. Base on a quick read in the
    documentation it would looke something like this:

    my $html = HTML::TreeBuilder->new_from_content( $res->decoded_content );
    for my $tag ($html->find('span') {
    my $class = $tag->attr('class');

    if ( $class eq 'artist' ) {
    ...;
    } elsif ( $class eq 'track' ) {
    ...;
    }
    }

    This would be a much more robust solution. (But I don't parse HTML in
    my day to day work, so I might not be uptodate on the current set of
    HTML parsers.)

    //Makholm
     
    Peter Makholm, Oct 5, 2010
    #4
  5. alfonsobaldaserra

    Guest

    On Tue, 5 Oct 2010 01:13:03 -0700 (PDT), alfonsobaldaserra <> wrote:

    >i got a real bad code working :)
    >
    >#!/usr/bin/perl
    >
    >use strict;
    >use warnings;
    >use LWP::UserAgent;
    >
    >my $ua = LWP::UserAgent->new;
    >$ua->timeout(10);
    >$ua->env_proxy;
    >
    >my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
    >
    >if ($res->is_success) {
    > open my $bbc, "<", \$res->decoded_content or die "$!\n";
    > while (defined (my $con = <$bbc>)) {
    > chomp $con;
    > next if $con =~ /^\s*$/;
    > next unless $con =~ m!(<span class="artist">)|(<span
    >class="track">)!;
    > $con =~ s/^\s*|\s*$//g;
    > if ($con =~ m!<span class="artist">(.*)</span>!) {
    > print $1, " - ";
    > } elsif ($con =~ m!<span class="track">(.*)</span>!) {
    > print $1, "\n";
    > }
    > }
    >}
    >
    >
    >thank you gents for giving me a chance to do it myself.
    >
    >though i am still looking for any improvements that you could
    >suggest :)


    Along the lines of what you are doing, something like below.
    -sln
    -----------
    use strict;
    use warnings;

    my $string =<<EOHTML;
    <html>
    <span class="artist">
    Tinie Tempah
    </span>
    <span class="track">
    Written In The Stars
    </span>
    <span class="artist"> Bruno Mars </span>
    <span class="track">Just The Way You Are (Amazing)</span>
    <span class="artist">
    Labrinth</span>
    <span class="track">Let The Sun Shine
    </span>
    <span class="track">A song by Labrinth</span>
    <span class="artist">Adele </span>
    <span class="track">Make You Feel My Love</span>
    <span class="artist">Taio Cruz</span>
    <span class="track">Dynamite</span>
    <html/>
    EOHTML
    my $artist;

    while ( $string =~
    / <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
    \s* (.*?) \s*
    <\/span\s*>
    /xsig )
    {
    if ($1 eq 'artist') {
    $artist = $2;
    }
    else {
    if (length $artist) {
    print "$artist - $2\n";
    }
    $artist = '';
    }
    }
    print "\n";

    ## Alternate -
    ##

    $artist = '';
    my %tracks;

    while ( $string =~
    / <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
    \s* (.*?) \s*
    <\/span\s*>
    /xsig )
    {
    if ($1 eq 'artist') {
    $artist = $2;
    }
    else {
    push @{ $tracks{$artist} }, $2;
    }
    }

    for $artist (sort keys %tracks) {
    print "\n$artist\n";
    for my $track ( sort @{ $tracks{$artist} } ) {
    print " - $track\n"
    }
    }
     
    , Oct 5, 2010
    #5
  6. thank you for such beautiful codes sln.

    though i am inclined towards peter's advise to use html parsers.
    unfortunately, i couldn't get your code to work due to lack of usage
    examples of html::treebuilder online.

    does anybody happen to know a good html parser with some good examples
    online?
     
    alfonsobaldaserra, Oct 6, 2010
    #6
  7. Peter Makholm, Oct 6, 2010
    #7
  8. > Huh?
    >
    > http://www.perlmonks.org/?node_id=2...roup/comp.lang.perl.misc/msg/372b363f0e9be360
    >
    > //Makholm


    thank you guys :)

    i finally utilised perlmonks link, read a little at cpan at here i am

    #!/usr/bin/perl

    use strict;
    use warnings;
    use HTML::Tree;
    use LWP::Simple;

    my $uri = "http://www.bbc.co.uk/radio1/chart/singles";

    my $html = get($uri);
    my $tree = HTML::Tree->new();
    $tree->parse($html);

    my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
    my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

    foreach my $i (0..$#artist) {
    print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
    }


    again i am wondering if there is a better way to group these two
    arrays together instead of the way i did

    foreach my $i (0..$#artist) {
    print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
    }

    thank you
     
    alfonsobaldaserra, Oct 21, 2010
    #8
  9. alfonsobaldaserra <> writes:

    > my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
    > my @track = $tree->look_down('_tag' , 'span', 'class', 'track');
    >
    > foreach my $i (0..$#artist) {
    > print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
    > }
    >
    > again i am wondering if there is a better way to group these two
    > arrays together instead of the way i did


    It all depends on the HTML. But looking at the URL you posted it looks
    like you're looke for a structure looking like this:

    <a class="artist-link" href="/music/artists/ba7d2626-38ce-4859-8495-bdb5732715c4" id="link-13">
    <span class="artist">Taio Cruz</span>
    <span class="track">Dynamite</span>
    </a>

    What you could do was to iterate over all the <a class="artist-link>
    nodes and then look for the artist and track below this
    node. Untested, but something like this:

    for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
    my $artist = $link->look_down(class => 'artist')->as_text;
    my $track = $link->look_down(class => 'track' )->as_text;

    print "$artist - $track\n";
    }

    //Makholm
     
    Peter Makholm, Oct 21, 2010
    #9
  10. > for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
    >     my $artist = $link->look_down(class => 'artist')->as_text;
    >     my $track  = $link->look_down(class => 'track' )->as_text;
    >
    >     print "$artist - $track\n";
    >
    > }
    >
    > //Makholm


    thank you again makholm, your code worked sexily without any
    modification :)
     
    alfonsobaldaserra, Oct 21, 2010
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. yookyung

    fetching webpage

    yookyung, Dec 30, 2005, in forum: Python
    Replies:
    1
    Views:
    317
  2. bruce

    fetching a POST webpage...

    bruce, Jul 7, 2006, in forum: Python
    Replies:
    1
    Views:
    347
  3. Barry

    Fetching a gzipped webpage

    Barry, May 26, 2010, in forum: Python
    Replies:
    1
    Views:
    351
    Peter Otten
    May 26, 2010
  4. sifar
    Replies:
    5
    Views:
    471
  5. soren625
    Replies:
    2
    Views:
    406
    soren625
    Dec 12, 2006
Loading...

Share This Page