Script Help

Discussion in 'Perl Misc' started by Kev, Oct 30, 2003.

  1. Kev

    Kev Guest

    I'm writing a script that is part of a larger script to index a defined list
    of websites. The portion that I'm working on is used to find all pages
    ending in .htm / .html so that I can search those pages and index them. I
    got the script to map out all the links. Can anyone help in eliminating the
    non .htm / html links obtained?

    #!/usr/bin/perl

    use HTML::LinkExtor;
    use LWP::Simple;

    $base_url = "http://www.cnn.com";
    $parser=HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links=$parser->links;

    foreach $linkarray(@links)
    {
    my @element = @$linkarray;
    my $elt_type = shift @element;
    while (@element)
    {
    my ($attr_name, $attr_value) = splice(@element, 0, 2);
    $seen{$attr_value}++;
    }
    }

    for (sort keys %seen)
    {
    print $_, "\n";
    }

    K.
     
    Kev, Oct 30, 2003
    #1
    1. Advertising

  2. Kev

    Kev Guest

    Jim / Purl Gurl thanks.


    "Jim Gibson" <> wrote in message
    news:301020031444148775%...
    > In article <vifob.37757$>, Kev
    > <> wrote:
    >
    > > I'm writing a script that is part of a larger script to index a defined

    list
    > > of websites. The portion that I'm working on is used to find all pages
    > > ending in .htm / .html so that I can search those pages and index them.

    I
    > > got the script to map out all the links. Can anyone help in eliminating

    the
    > > non .htm / html links obtained?
    > >
    > > #!/usr/bin/perl

    >
    > use strict;
    >
    > >
    > > use HTML::LinkExtor;
    > > use LWP::Simple;
    > >
    > > $base_url = "http://www.cnn.com";

    >
    > my $base_url ...
    >
    > > $parser=HTML::LinkExtor->new(undef, $base_url);

    >
    > my $parser = ...
    >
    > > $parser->parse(get($base_url))->eof;
    > > @links=$parser->links;

    >
    > my @links = ...
    > my %seen;
    >
    > >
    > > foreach $linkarray(@links)
    > > {
    > > my @element = @$linkarray;
    > > my $elt_type = shift @element;

    >
    > You can start by eliminating img and script links:
    >
    > next if $elt_type =~ /^img|script$/i;
    >
    >
    > > while (@element)
    > > {
    > > my ($attr_name, $attr_value) = splice(@element, 0, 2);
    > > $seen{$attr_value}++;

    >
    > You can accept only URLs with '.htm' in them:
    >
    > $seen{$attr_value}++ if $attr_value =~ /\.htm/;
    > > }
    > > }
    > >
    > > for (sort keys %seen)
    > > {
    > > print $_, "\n";
    > > }
    > >

    >
    > leaving 59 links out of the 278 you started with (today anyway).
    >
    > You will miss some HTML links that don't have explicit file names but
    > ar depending on the server to supply index.html or its ilk if only a
    > directory name is give. There might be some false matches for files
    > that have '.htm' in them somewhere other than the end, but finding the
    > end of a file name in a URL seems a bit tricky.
     
    Kev, Oct 30, 2003
    #2
    1. Advertising

  3. Kev

    Ben Morrow Guest

    Jim Gibson <> wrote:
    > In article <vifob.37757$>, Kev
    > <> wrote:
    >
    > > I'm writing a script that is part of a larger script to index a
    > > defined list of websites. The portion that I'm working on is used
    > > to find all pages ending in .htm / .html so that I can search
    > > those pages and index them. I got the script to map out all the
    > > links. Can anyone help in eliminating the non .htm / html links
    > > obtained?

    <snip>
    >
    > You can accept only URLs with '.htm' in them:
    > $seen{$attr_value}++ if $attr_value =~ /\.htm/;

    <snip>
    >
    > You will miss some HTML links that don't have explicit file names but
    > ar depending on the server to supply index.html or its ilk if only a
    > directory name is give.


    ....or otherwise are text/html but not named with .html, such as most
    CGIs, for instance. Since you're using LWP anyway, you can make a HEAD
    request for each page (after eliminating scripts/imgs) and check the
    type. This will (should) be rather faster than making a full request.

    > There might be some false matches for files
    > that have '.htm' in them somewhere other than the end, but finding the
    > end of a file name in a URL seems a bit tricky.


    $seen{$attr_value}++ if $attr_value =~ / \. htm l? (?: $ | \? ) /x;

    Ben

    --
    . | .
    \ / The clueometer is reading zero.
    . .
    __ <-----@ __
     
    Ben Morrow, Oct 30, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dpackwood
    Replies:
    3
    Views:
    1,841
  2. Rajat
    Replies:
    3
    Views:
    730
    Jorgen Grahn
    Jan 8, 2010
  3. VYAS ASHISH M-NTB837
    Replies:
    2
    Views:
    589
    Jan Kaliszewski
    Jan 7, 2010
  4. Greg
    Replies:
    1
    Views:
    177
    Gunnar Hjalmarsson
    Jun 6, 2005
  5. Replies:
    9
    Views:
    190
Loading...

Share This Page