Find string in web page

K

Kirk Larsen

Sounds simple enough. I need to retrieve the source from a web page
and then find a link in that web page that ends with a string which I
have stored in a variable. Can someone please post or direct me to a
sample of how to do this? Thanks!
 
G

Greg Bacon

: Sounds simple enough. I need to retrieve the source from a web page
: and then find a link in that web page that ends with a string which I
: have stored in a variable. Can someone please post or direct me to a
: sample of how to do this? Thanks!

Try this on for size:

% cat try
#! /usr/local/bin/perl

use strict;
use warnings;

use HTML::parser;
use LWP::UserAgent;
use URI::URL;
use Data::Dumper;

sub make_parser {
my $inside;
my %attr;
my $text;
my @links;

my $record = sub {
my $state = Dumper {
inside => $inside,
attr => \%attr,
text => $text,
};

my @cond = (
[ sub { $state }, "not inside" ],
[ sub { %attr }, "no attr" ],
[ sub { $attr{href} }, "no href" ],
);

my $ok = 1;
for (@cond) {
my($check,$msg) = @$_;

unless ($check->()) {
warn "$0: $msg:\n$state ";
$ok = 0;
}
}

push @links => [ $text || '<empty>', $attr{href} ] if $ok;

$inside = 0;
%attr = ();
$text = '';
};

my $start_h = sub {
my $tag = shift;
return unless $tag eq 'a';

if ($inside) {
warn "$0: already inside";
$record->();
}

my $attr = shift;
return unless $attr->{href};

%attr = %$attr;
$inside = 1;
};

my $text_h = sub {
return unless $inside;

$text .= shift;
};

my $end_h = sub {
my $tag = shift;
return unless $tag eq 'a';

return unless $inside;

$record->();
};

my $p = HTML::parser->new(
api_version => 3,
start_h => [ $start_h, "tagname, attr" ],
text_h => [ $text_h, "dtext" ],
end_h => [ $end_h, "tagname" ],
);

($p, sub { @links });
}

sub usage () { "Usage: $0 search-pattern\n" }

## main
die usage unless @ARGV;

my $pat = shift;
my $lookfor = eval { qr/$pat/ };
die "$0: bad pattern: $pat" unless $lookfor;

my $url = "http://www.cpan.org/";
my $ua = LWP::UserAgent->new;

my($p,$links) = make_parser;

# Request document and parse it as it arrives
my $res = $ua->request(
HTTP::Request->new(GET => $url),
sub { $p->parse($_[0]) }
);

my $base = $res->base;
for ($links->()) {
my($text,$href) = @$_;

next unless $text =~ /$lookfor$/;

my $url = url($href, $base)->abs;

$text =~ s/\s+/ /g;
print "$text:\n $url\n";
}
% ./try 's$'
Perl modules:
http://www.cpan.org/modules/index.html
Perl scripts:
http://www.cpan.org/scripts/index.html
Perl recent arrivals:
http://www.cpan.org/RECENT.html
CPAN sites:
http://www.cpan.org/SITES.html
CPAN sites:
http://mirrors.cpan.org/
CPAN modules, distributions, and authors:
http://search.cpan.org/
CPAN Frequently Asked Questions:
http://www.cpan.org/misc/cpan-faq.html
Perl Mailing Lists:
http://lists.cpan.org/
Perl Bookmarks:
http://bookmarks.cpan.org/
% ./try '('
./try: bad pattern: ( at ./try line 95.

Hope this helps,
Greg
 
G

Greg Bacon

: Can't seem to get it to work. It just outputs nothing. Am I doing
: something wrong, or is there another way? I did print out my search
: string var and verified that it is in the source I'm searching, so
: that's not the problem. Thanks again!

Out of the box, does the code produce the same output as shown in
my followup?

What are you looking for? It looks like I was forcing the match to
be at the end:

next unless $text =~ /$lookfor$/;

If you don't want to look at the end, change that to

next unless $text =~ /$lookfor/;

It would also help if you showed your code, but, as always with
Usenet, cutting-and-pasting megabytes of source code isn't useful.

Greg
 
M

Mina Naguib

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kirk said:
Sounds simple enough. I need to retrieve the source from a web page

use LWP::Simple;
and then find a link in that web page that ends with a string which I
have stored in a variable.

There are a few ways to do this. I prefer HTML::TokeParser;
Can someone please post or direct me to a
sample of how to do this? Thanks!


my $url = 'http://www.freebsd.org';
my $match = 'man.cgi';

use LWP::Simple;
use HTML::TokeParser;

my $document = get($url) || die "Failed to retrieve document\n";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$match$/) {
print "I matched $token->[1]->{href}\n";
}
}

For more information, see http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm and
http://search.cpan.org/dist/libwww-perl/lib/LWP/Simple.pm.

Note that links are often relative, which means you'll often get a link to "something.html" instead
of "http://domain.com/dir/something.html". It'll be up to you to extrapolate the domain and
directory structure of the original URL (and append to it the link data, as well as possibly take
into account any ../.././ calls) to determine the full URL to call next.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/DkfieS99pGMif6wRApEdAJwIJrCRTLNOgtsxCSUYCY7NyO6/AgCZATFH
cc0PEq+mFhTbBDrQ/79fah4=
=/K0i
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top