How would I set up a hash for the following

C

chadda

I hardcode the categories into the script because I have no idea how
to make the script traverse a site that has urls going to other urls
that in turn is going to other urls. Added on top of that, I want the
script to only follow the urls that have certain words in them.

Anyhow, when I do something like the following...


#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;
use LWP::UserAgent;
use HTML::LinkExtor;

my @urls;

#for privoxy
my $browser = LWP::UserAgent->new;
$browser->proxy( ['http', 'https' ], "http://localhost:8118");

#my categories
my $acer_laptops = 'http://www.doba.com/catalog/search/search.php?
filters[submit]=advanced&filters[i
nc_noimage]=0&filters[inc_outofstock]=0&filters[inc_discontinued]=0&filters[inc_refurbished]=1&filte
rs[inc_pro_only]=0&filters[min_qty]=0&filters[category]=112666';

my $html = get($acer_laptops);

my $get_links = new HTML::LinkExtor;
$get_links->parse($html);

my @links = $get_links->links;
foreach (@links) {
# $_ contains [type, [name, value], ...]
shift @$_;
while (my ($name, $value) = splice(@$_, 0, 2)) {
if($value =~/\/catalog\/search\/search_hit/){
push(@urls, $value);
push(@urls, "\n");
#print " $name -> $value\n";
}
}
}

I get the following
../buildfile.pl
/catalog/search/search_hit.php?product_id=2969975&location=/catalog/
2969975.html
/catalog/search/search_hit.php?product_id=2969975&location=/catalog/
2969975.html
/catalog/search/search_hit.php?product_id=2988526&location=/catalog/
2988526.html
/catalog/search/search_hit.php?product_id=2988526&location=/catalog/
2988526.html
/catalog/search/search_hit.php?product_id=2994617&location=/catalog/
2994617.html
/catalog/search/search_hit.php?product_id=2994617&location=/catalog/
2994617.html
/catalog/search/search_hit.php?product_id=3041783&location=/catalog/
3041783.html
/catalog/search/search_hit.php?product_id=3041783&location=/catalog/
3041783.html
/catalog/search/search_hit.php?product_id=3117275&location=/catalog/
3117275.html
/catalog/search/search_hit.php?product_id=3117275&location=/catalog/
3117275.html
/catalog/search/search_hit.php?product_id=3132778&location=/catalog/
3132778.html
/catalog/search/search_hit.php?product_id=3132778&location=/catalog/
3132778.html
/catalog/search/search_hit.php?product_id=3137118&location=/catalog/
3137118.html
/catalog/search/search_hit.php?product_id=3137118&location=/catalog/
3137118.html
/catalog/search/search_hit.php?product_id=3137121&location=/catalog/
3137121.html
/catalog/search/search_hit.php?product_id=3137121&location=/catalog/
3137121.html
/catalog/search/search_hit.php?product_id=3137123&location=/catalog/
3137123.html
/catalog/search/search_hit.php?product_id=3137123&location=/catalog/
3137123.html
/catalog/search/search_hit.php?product_id=3137124&location=/catalog/
3137124.html
/catalog/search/search_hit.php?product_id=3137124&location=/catalog/
3137124.html
/catalog/search/search_hit.php?product_id=3610730&location=/catalog/
3610730.html
/catalog/search/search_hit.php?product_id=3610730&location=/catalog/
3610730.html
/catalog/search/search_hit.php?product_id=3610734&location=/catalog/
3610734.html
/catalog/search/search_hit.php?product_id=3610734&location=/catalog/
3610734.html

50% of the URLS are duplicates. I know there is a perl faq for
removing duplicate hash entries. The question is, how would I set up a
hash when the only values are the urls? Also, input on how to improve
my code are more than welcome.
 
J

Jürgen Exner

50% of the URLS are duplicates. I know there is a perl faq for
removing duplicate hash entries. The question is, how would I set up a
hash when the only values are the urls?

Create a hash, for each URL add an entry in that hash where the key is
the URL and the value is 1 (or even leave the value undefined, you will
never use it anyway).
To retrieve all URLs just do a keys() on the hash.

jue
 
C

chadda

Create a hash, for each URL add an entry in that hash where the key is
the URL and the value is 1 (or even leave the value undefined, you will
never use it anyway).
To retrieve all URLs just do a keys() on the hash.

jue


Got it. Thanks.
 
G

Gibbering

I hardcode the categories into the script because I have no idea how
to make the script traverse a site that has urls going to other urls
that in turn is going to other urls. Added on top of that, I want the
script to only follow the urls that have certain words in them.

Anyhow, when I do something like the following...

#!/usr/bin/perl

*SNIP*

What you probably want to do is hash the product ids ... something
like:

use Data::Dumper;
my %ids
for ($get_links->links) {
my $url = pop @$_;
my ($id) = $url =~ /product_id=(\d+)/;
++$ids{$id} if $id;
}

print Dumper \%ids;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top