K
Kev
I'm writing a script that is part of a larger script to index a defined list
of websites. The portion that I'm working on is used to find all pages
ending in .htm / .html so that I can search those pages and index them. I
got the script to map out all the links. Can anyone help in eliminating the
non .htm / html links obtained?
#!/usr/bin/perl
use HTML::LinkExtor;
use LWP::Simple;
$base_url = "http://www.cnn.com";
$parser=HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links=$parser->links;
foreach $linkarray(@links)
{
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element)
{
my ($attr_name, $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}
for (sort keys %seen)
{
print $_, "\n";
}
K.
of websites. The portion that I'm working on is used to find all pages
ending in .htm / .html so that I can search those pages and index them. I
got the script to map out all the links. Can anyone help in eliminating the
non .htm / html links obtained?
#!/usr/bin/perl
use HTML::LinkExtor;
use LWP::Simple;
$base_url = "http://www.cnn.com";
$parser=HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links=$parser->links;
foreach $linkarray(@links)
{
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element)
{
my ($attr_name, $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}
for (sort keys %seen)
{
print $_, "\n";
}
K.