Script Help

K

Kev

I'm writing a script that is part of a larger script to index a defined list
of websites. The portion that I'm working on is used to find all pages
ending in .htm / .html so that I can search those pages and index them. I
got the script to map out all the links. Can anyone help in eliminating the
non .htm / html links obtained?

#!/usr/bin/perl

use HTML::LinkExtor;
use LWP::Simple;

$base_url = "http://www.cnn.com";
$parser=HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links=$parser->links;

foreach $linkarray(@links)
{
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element)
{
my ($attr_name, $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}

for (sort keys %seen)
{
print $_, "\n";
}

K.
 
K

Kev

Jim / Purl Gurl thanks.


Jim Gibson said:
use strict;


my $base_url ...


my $parser = ...


my @links = ...
my %seen;


You can start by eliminating img and script links:

next if $elt_type =~ /^img|script$/i;



You can accept only URLs with '.htm' in them:

$seen{$attr_value}++ if $attr_value =~ /\.htm/;

leaving 59 links out of the 278 you started with (today anyway).

You will miss some HTML links that don't have explicit file names but
ar depending on the server to supply index.html or its ilk if only a
directory name is give. There might be some false matches for files
that have '.htm' in them somewhere other than the end, but finding the
end of a file name in a URL seems a bit tricky.
 
B

Ben Morrow

You can accept only URLs with '.htm' in them:
$seen{$attr_value}++ if $attr_value =~ /\.htm/;
You will miss some HTML links that don't have explicit file names but
ar depending on the server to supply index.html or its ilk if only a
directory name is give.

....or otherwise are text/html but not named with .html, such as most
CGIs, for instance. Since you're using LWP anyway, you can make a HEAD
request for each page (after eliminating scripts/imgs) and check the
type. This will (should) be rather faster than making a full request.
There might be some false matches for files
that have '.htm' in them somewhere other than the end, but finding the
end of a file name in a URL seems a bit tricky.

$seen{$attr_value}++ if $attr_value =~ / \. htm l? (?: $ | \? ) /x;

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top