How would I set up a hash for the following

Discussion in 'Perl Misc' started by chadda, May 20, 2008.

  1. chadda

    chadda Guest

    I hardcode the categories into the script because I have no idea how
    to make the script traverse a site that has urls going to other urls
    that in turn is going to other urls. Added on top of that, I want the
    script to only follow the urls that have certain words in them.

    Anyhow, when I do something like the following...


    #!/usr/bin/perl

    use strict;
    use warnings;

    use HTML::TokeParser;
    use LWP::Simple;
    use LWP::UserAgent;
    use HTML::LinkExtor;

    my @urls;

    #for privoxy
    my $browser = LWP::UserAgent->new;
    $browser->proxy( ['http', 'https' ], "http://localhost:8118");

    #my categories
    my $acer_laptops = 'http://www.doba.com/catalog/search/search.php?
    filters[submit]=advanced&filters[i
    nc_noimage]=0&filters[inc_outofstock]=0&filters[inc_discontinued]=0&filters[inc_refurbished]=1&filte
    rs[inc_pro_only]=0&filters[min_qty]=0&filters[category]=112666';

    my $html = get($acer_laptops);

    my $get_links = new HTML::LinkExtor;
    $get_links->parse($html);

    my @links = $get_links->links;
    foreach (@links) {
    # $_ contains [type, [name, value], ...]
    shift @$_;
    while (my ($name, $value) = splice(@$_, 0, 2)) {
    if($value =~/\/catalog\/search\/search_hit/){
    push(@urls, $value);
    push(@urls, "\n");
    #print " $name -> $value\n";
    }
    }
    }

    I get the following
    ../buildfile.pl
    /catalog/search/search_hit.php?product_id=2969975&location=/catalog/
    2969975.html
    /catalog/search/search_hit.php?product_id=2969975&location=/catalog/
    2969975.html
    /catalog/search/search_hit.php?product_id=2988526&location=/catalog/
    2988526.html
    /catalog/search/search_hit.php?product_id=2988526&location=/catalog/
    2988526.html
    /catalog/search/search_hit.php?product_id=2994617&location=/catalog/
    2994617.html
    /catalog/search/search_hit.php?product_id=2994617&location=/catalog/
    2994617.html
    /catalog/search/search_hit.php?product_id=3041783&location=/catalog/
    3041783.html
    /catalog/search/search_hit.php?product_id=3041783&location=/catalog/
    3041783.html
    /catalog/search/search_hit.php?product_id=3117275&location=/catalog/
    3117275.html
    /catalog/search/search_hit.php?product_id=3117275&location=/catalog/
    3117275.html
    /catalog/search/search_hit.php?product_id=3132778&location=/catalog/
    3132778.html
    /catalog/search/search_hit.php?product_id=3132778&location=/catalog/
    3132778.html
    /catalog/search/search_hit.php?product_id=3137118&location=/catalog/
    3137118.html
    /catalog/search/search_hit.php?product_id=3137118&location=/catalog/
    3137118.html
    /catalog/search/search_hit.php?product_id=3137121&location=/catalog/
    3137121.html
    /catalog/search/search_hit.php?product_id=3137121&location=/catalog/
    3137121.html
    /catalog/search/search_hit.php?product_id=3137123&location=/catalog/
    3137123.html
    /catalog/search/search_hit.php?product_id=3137123&location=/catalog/
    3137123.html
    /catalog/search/search_hit.php?product_id=3137124&location=/catalog/
    3137124.html
    /catalog/search/search_hit.php?product_id=3137124&location=/catalog/
    3137124.html
    /catalog/search/search_hit.php?product_id=3610730&location=/catalog/
    3610730.html
    /catalog/search/search_hit.php?product_id=3610730&location=/catalog/
    3610730.html
    /catalog/search/search_hit.php?product_id=3610734&location=/catalog/
    3610734.html
    /catalog/search/search_hit.php?product_id=3610734&location=/catalog/
    3610734.html

    50% of the URLS are duplicates. I know there is a perl faq for
    removing duplicate hash entries. The question is, how would I set up a
    hash when the only values are the urls? Also, input on how to improve
    my code are more than welcome.
     
    chadda, May 20, 2008
    #1
    1. Advertisements

  2. Create a hash, for each URL add an entry in that hash where the key is
    the URL and the value is 1 (or even leave the value undefined, you will
    never use it anyway).
    To retrieve all URLs just do a keys() on the hash.

    jue
     
    Jürgen Exner, May 20, 2008
    #2
    1. Advertisements

  3. chadda

    chadda Guest


    Got it. Thanks.
     
    chadda, May 20, 2008
    #3
  4. chadda

    Gibbering Guest

    *SNIP*

    What you probably want to do is hash the product ids ... something
    like:

    use Data::Dumper;
    my %ids
    for ($get_links->links) {
    my $url = pop @$_;
    my ($id) = $url =~ /product_id=(\d+)/;
    ++$ids{$id} if $id;
    }

    print Dumper \%ids;
     
    Gibbering, May 20, 2008
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.