Find string in web page

Discussion in 'Perl Misc' started by Kirk Larsen, Jul 9, 2003.

  1. Kirk Larsen

    Kirk Larsen Guest

    Sounds simple enough. I need to retrieve the source from a web page
    and then find a link in that web page that ends with a string which I
    have stored in a variable. Can someone please post or direct me to a
    sample of how to do this? Thanks!
    Kirk Larsen, Jul 9, 2003
    #1
    1. Advertising

  2. Kirk Larsen

    Greg Bacon Guest

    In article <>,
    Kirk Larsen <> wrote:

    : Sounds simple enough. I need to retrieve the source from a web page
    : and then find a link in that web page that ends with a string which I
    : have stored in a variable. Can someone please post or direct me to a
    : sample of how to do this? Thanks!

    Try this on for size:

    % cat try
    #! /usr/local/bin/perl

    use strict;
    use warnings;

    use HTML::parser;
    use LWP::UserAgent;
    use URI::URL;
    use Data::Dumper;

    sub make_parser {
    my $inside;
    my %attr;
    my $text;
    my @links;

    my $record = sub {
    my $state = Dumper {
    inside => $inside,
    attr => \%attr,
    text => $text,
    };

    my @cond = (
    [ sub { $state }, "not inside" ],
    [ sub { %attr }, "no attr" ],
    [ sub { $attr{href} }, "no href" ],
    );

    my $ok = 1;
    for (@cond) {
    my($check,$msg) = @$_;

    unless ($check->()) {
    warn "$0: $msg:\n$state ";
    $ok = 0;
    }
    }

    push @links => [ $text || '<empty>', $attr{href} ] if $ok;

    $inside = 0;
    %attr = ();
    $text = '';
    };

    my $start_h = sub {
    my $tag = shift;
    return unless $tag eq 'a';

    if ($inside) {
    warn "$0: already inside";
    $record->();
    }

    my $attr = shift;
    return unless $attr->{href};

    %attr = %$attr;
    $inside = 1;
    };

    my $text_h = sub {
    return unless $inside;

    $text .= shift;
    };

    my $end_h = sub {
    my $tag = shift;
    return unless $tag eq 'a';

    return unless $inside;

    $record->();
    };

    my $p = HTML::parser->new(
    api_version => 3,
    start_h => [ $start_h, "tagname, attr" ],
    text_h => [ $text_h, "dtext" ],
    end_h => [ $end_h, "tagname" ],
    );

    ($p, sub { @links });
    }

    sub usage () { "Usage: $0 search-pattern\n" }

    ## main
    die usage unless @ARGV;

    my $pat = shift;
    my $lookfor = eval { qr/$pat/ };
    die "$0: bad pattern: $pat" unless $lookfor;

    my $url = "http://www.cpan.org/";
    my $ua = LWP::UserAgent->new;

    my($p,$links) = make_parser;

    # Request document and parse it as it arrives
    my $res = $ua->request(
    HTTP::Request->new(GET => $url),
    sub { $p->parse($_[0]) }
    );

    my $base = $res->base;
    for ($links->()) {
    my($text,$href) = @$_;

    next unless $text =~ /$lookfor$/;

    my $url = url($href, $base)->abs;

    $text =~ s/\s+/ /g;
    print "$text:\n $url\n";
    }
    % ./try 's$'
    Perl modules:
    http://www.cpan.org/modules/index.html
    Perl scripts:
    http://www.cpan.org/scripts/index.html
    Perl recent arrivals:
    http://www.cpan.org/RECENT.html
    CPAN sites:
    http://www.cpan.org/SITES.html
    CPAN sites:
    http://mirrors.cpan.org/
    CPAN modules, distributions, and authors:
    http://search.cpan.org/
    CPAN Frequently Asked Questions:
    http://www.cpan.org/misc/cpan-faq.html
    Perl Mailing Lists:
    http://lists.cpan.org/
    Perl Bookmarks:
    http://bookmarks.cpan.org/
    % ./try '('
    ./try: bad pattern: ( at ./try line 95.

    Hope this helps,
    Greg
    --
    In a system of full capitalism, there should be (but, historically, has not
    yet been) a complete separation of state and economics, in the same way and
    for the same reasons as the separation of state and church.
    -- Ayn Rand
    Greg Bacon, Jul 9, 2003
    #2
    1. Advertising

  3. Kirk Larsen

    Greg Bacon Guest

    In article <>,
    Kirk Larsen <> wrote:

    : Can't seem to get it to work. It just outputs nothing. Am I doing
    : something wrong, or is there another way? I did print out my search
    : string var and verified that it is in the source I'm searching, so
    : that's not the problem. Thanks again!

    Out of the box, does the code produce the same output as shown in
    my followup?

    What are you looking for? It looks like I was forcing the match to
    be at the end:

    next unless $text =~ /$lookfor$/;

    If you don't want to look at the end, change that to

    next unless $text =~ /$lookfor/;

    It would also help if you showed your code, but, as always with
    Usenet, cutting-and-pasting megabytes of source code isn't useful.

    Greg
    --
    The greatest dangers to liberty lurk in insidious encroachment by men
    of zeal, well-meaning but without understanding.
    -- Justice Louis D. Brandeis
    Greg Bacon, Jul 10, 2003
    #3
  4. Kirk Larsen

    Mina Naguib Guest

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Kirk Larsen wrote:
    > Sounds simple enough. I need to retrieve the source from a web page


    use LWP::Simple;

    > and then find a link in that web page that ends with a string which I
    > have stored in a variable.


    There are a few ways to do this. I prefer HTML::TokeParser;

    > Can someone please post or direct me to a
    > sample of how to do this? Thanks!



    my $url = 'http://www.freebsd.org';
    my $match = 'man.cgi';

    use LWP::Simple;
    use HTML::TokeParser;

    my $document = get($url) || die "Failed to retrieve document\n";

    my $parser = HTML::TokeParser->new(\$document);

    while ($token = $parser->get_tag("a")) {
    if ($token->[1]->{"href"} =~ /$match$/) {
    print "I matched $token->[1]->{href}\n";
    }
    }

    For more information, see http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm and
    http://search.cpan.org/dist/libwww-perl/lib/LWP/Simple.pm.

    Note that links are often relative, which means you'll often get a link to "something.html" instead
    of "http://domain.com/dir/something.html". It'll be up to you to extrapolate the domain and
    directory structure of the original URL (and append to it the link data, as well as possibly take
    into account any ../.././ calls) to determine the full URL to call next.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.1 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    iD8DBQE/DkfieS99pGMif6wRApEdAJwIJrCRTLNOgtsxCSUYCY7NyO6/AgCZATFH
    cc0PEq+mFhTbBDrQ/79fah4=
    =/K0i
    -----END PGP SIGNATURE-----
    Mina Naguib, Jul 11, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. petry

    std::string::find vs std::find

    petry, Jul 5, 2009, in forum: C Programming
    Replies:
    1
    Views:
    345
    petry
    Jul 5, 2009
  2. Wybo Dekker
    Replies:
    1
    Views:
    352
    Yukihiro Matsumoto
    Nov 15, 2005
  3. vdvorkin
    Replies:
    0
    Views:
    401
    vdvorkin
    Feb 10, 2011
  4. vdvorkin
    Replies:
    3
    Views:
    810
    vdvorkin
    Feb 14, 2011
  5. Replies:
    3
    Views:
    375
Loading...

Share This Page