C
chadda
I hardcode the categories into the script because I have no idea how
to make the script traverse a site that has urls going to other urls
that in turn is going to other urls. Added on top of that, I want the
script to only follow the urls that have certain words in them.
Anyhow, when I do something like the following...
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
use LWP::Simple;
use LWP::UserAgent;
use HTML::LinkExtor;
my @urls;
#for privoxy
my $browser = LWP::UserAgent->new;
$browser->proxy( ['http', 'https' ], "http://localhost:8118");
#my categories
my $acer_laptops = 'http://www.doba.com/catalog/search/search.php?
filters[submit]=advanced&filters[i
nc_noimage]=0&filters[inc_outofstock]=0&filters[inc_discontinued]=0&filters[inc_refurbished]=1&filte
rs[inc_pro_only]=0&filters[min_qty]=0&filters[category]=112666';
my $html = get($acer_laptops);
my $get_links = new HTML::LinkExtor;
$get_links->parse($html);
my @links = $get_links->links;
foreach (@links) {
# $_ contains [type, [name, value], ...]
shift @$_;
while (my ($name, $value) = splice(@$_, 0, 2)) {
if($value =~/\/catalog\/search\/search_hit/){
push(@urls, $value);
push(@urls, "\n");
#print " $name -> $value\n";
}
}
}
I get the following
../buildfile.pl
/catalog/search/search_hit.php?product_id=2969975&location=/catalog/
2969975.html
/catalog/search/search_hit.php?product_id=2969975&location=/catalog/
2969975.html
/catalog/search/search_hit.php?product_id=2988526&location=/catalog/
2988526.html
/catalog/search/search_hit.php?product_id=2988526&location=/catalog/
2988526.html
/catalog/search/search_hit.php?product_id=2994617&location=/catalog/
2994617.html
/catalog/search/search_hit.php?product_id=2994617&location=/catalog/
2994617.html
/catalog/search/search_hit.php?product_id=3041783&location=/catalog/
3041783.html
/catalog/search/search_hit.php?product_id=3041783&location=/catalog/
3041783.html
/catalog/search/search_hit.php?product_id=3117275&location=/catalog/
3117275.html
/catalog/search/search_hit.php?product_id=3117275&location=/catalog/
3117275.html
/catalog/search/search_hit.php?product_id=3132778&location=/catalog/
3132778.html
/catalog/search/search_hit.php?product_id=3132778&location=/catalog/
3132778.html
/catalog/search/search_hit.php?product_id=3137118&location=/catalog/
3137118.html
/catalog/search/search_hit.php?product_id=3137118&location=/catalog/
3137118.html
/catalog/search/search_hit.php?product_id=3137121&location=/catalog/
3137121.html
/catalog/search/search_hit.php?product_id=3137121&location=/catalog/
3137121.html
/catalog/search/search_hit.php?product_id=3137123&location=/catalog/
3137123.html
/catalog/search/search_hit.php?product_id=3137123&location=/catalog/
3137123.html
/catalog/search/search_hit.php?product_id=3137124&location=/catalog/
3137124.html
/catalog/search/search_hit.php?product_id=3137124&location=/catalog/
3137124.html
/catalog/search/search_hit.php?product_id=3610730&location=/catalog/
3610730.html
/catalog/search/search_hit.php?product_id=3610730&location=/catalog/
3610730.html
/catalog/search/search_hit.php?product_id=3610734&location=/catalog/
3610734.html
/catalog/search/search_hit.php?product_id=3610734&location=/catalog/
3610734.html
50% of the URLS are duplicates. I know there is a perl faq for
removing duplicate hash entries. The question is, how would I set up a
hash when the only values are the urls? Also, input on how to improve
my code are more than welcome.
to make the script traverse a site that has urls going to other urls
that in turn is going to other urls. Added on top of that, I want the
script to only follow the urls that have certain words in them.
Anyhow, when I do something like the following...
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
use LWP::Simple;
use LWP::UserAgent;
use HTML::LinkExtor;
my @urls;
#for privoxy
my $browser = LWP::UserAgent->new;
$browser->proxy( ['http', 'https' ], "http://localhost:8118");
#my categories
my $acer_laptops = 'http://www.doba.com/catalog/search/search.php?
filters[submit]=advanced&filters[i
nc_noimage]=0&filters[inc_outofstock]=0&filters[inc_discontinued]=0&filters[inc_refurbished]=1&filte
rs[inc_pro_only]=0&filters[min_qty]=0&filters[category]=112666';
my $html = get($acer_laptops);
my $get_links = new HTML::LinkExtor;
$get_links->parse($html);
my @links = $get_links->links;
foreach (@links) {
# $_ contains [type, [name, value], ...]
shift @$_;
while (my ($name, $value) = splice(@$_, 0, 2)) {
if($value =~/\/catalog\/search\/search_hit/){
push(@urls, $value);
push(@urls, "\n");
#print " $name -> $value\n";
}
}
}
I get the following
../buildfile.pl
/catalog/search/search_hit.php?product_id=2969975&location=/catalog/
2969975.html
/catalog/search/search_hit.php?product_id=2969975&location=/catalog/
2969975.html
/catalog/search/search_hit.php?product_id=2988526&location=/catalog/
2988526.html
/catalog/search/search_hit.php?product_id=2988526&location=/catalog/
2988526.html
/catalog/search/search_hit.php?product_id=2994617&location=/catalog/
2994617.html
/catalog/search/search_hit.php?product_id=2994617&location=/catalog/
2994617.html
/catalog/search/search_hit.php?product_id=3041783&location=/catalog/
3041783.html
/catalog/search/search_hit.php?product_id=3041783&location=/catalog/
3041783.html
/catalog/search/search_hit.php?product_id=3117275&location=/catalog/
3117275.html
/catalog/search/search_hit.php?product_id=3117275&location=/catalog/
3117275.html
/catalog/search/search_hit.php?product_id=3132778&location=/catalog/
3132778.html
/catalog/search/search_hit.php?product_id=3132778&location=/catalog/
3132778.html
/catalog/search/search_hit.php?product_id=3137118&location=/catalog/
3137118.html
/catalog/search/search_hit.php?product_id=3137118&location=/catalog/
3137118.html
/catalog/search/search_hit.php?product_id=3137121&location=/catalog/
3137121.html
/catalog/search/search_hit.php?product_id=3137121&location=/catalog/
3137121.html
/catalog/search/search_hit.php?product_id=3137123&location=/catalog/
3137123.html
/catalog/search/search_hit.php?product_id=3137123&location=/catalog/
3137123.html
/catalog/search/search_hit.php?product_id=3137124&location=/catalog/
3137124.html
/catalog/search/search_hit.php?product_id=3137124&location=/catalog/
3137124.html
/catalog/search/search_hit.php?product_id=3610730&location=/catalog/
3610730.html
/catalog/search/search_hit.php?product_id=3610730&location=/catalog/
3610730.html
/catalog/search/search_hit.php?product_id=3610734&location=/catalog/
3610734.html
/catalog/search/search_hit.php?product_id=3610734&location=/catalog/
3610734.html
50% of the URLS are duplicates. I know there is a perl faq for
removing duplicate hash entries. The question is, how would I set up a
hash when the only values are the urls? Also, input on how to improve
my code are more than welcome.