Need ideas on how to make this code faster than a speeding turtle

Discussion in 'Perl Misc' started by chadda@lonemerchant.com, May 15, 2008.

  1. Guest

    I 'll eventually have the input file filled with 350 million items.
    Right now there is only one

    $more input
    3308191

    The following program reads in the number from the file named 'input'
    and builds a url form this number. Then it builds a url from this
    number. I have lynx then dump the data into a file called 'out' and
    then just grep the entire thing for the Product Number, Product ID,
    SKU, UPC, and weight.


    m-net% more parse.pl
    #!/usr/bin/perl -w

    my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
    my $temp;

    open(IN, '<', 'input') || die "cant open: $!";
    $read = <IN>;
    chomp($read);
    $build = "http://www.doba.com/members/catalog/".$read.".html";
    $temp = `lynx -accept_all_cookies -dump $build`;
    open(OUTFILE, '>out');
    print OUTFILE $temp;
    close OUTFILE;

    open(OUT, '<', 'out') || die "cant open: $!";
    @shit = <OUT>;

    @product = grep(/Product ID/, @shit);
    @id = grep(/Item ID/, @shit);
    @sku = grep(/SKU/, @shit);
    @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
    get some extra data after UPC.
    @weight = grep(/Weight/, @shit);

    print @product;
    print @id;
    print @sku;
    print @upc;
    print @weight;

    % ./parse.pl
    Product ID: 3308191
    Item ID: 3653992
    SKU: 8930
    UPC: 896207999816 Condition: refurbished
    Weight: 4.7 lbs.
     
    , May 15, 2008
    #1
    1. Advertising

  2. Uri Guttman Guest

    Re: Need ideas on how to make this code faster than a speedingturtle

    >>>>> "c" == chadda <> writes:


    i have to know if you could write this mess any slower? you are doing
    everything possible to slow you down.

    c> open(IN, '<', 'input') || die "cant open: $!";
    c> $read = <IN>;
    c> chomp($read);
    c> $build = "http://www.doba.com/members/catalog/".$read.".html";
    c> $temp = `lynx -accept_all_cookies -dump $build`;

    why are you calling out to a program when perl can load web pages just
    fine with LWP? did you even look for web stuff on cpan?

    c> open(OUTFILE, '>out');
    c> print OUTFILE $temp;
    c> close OUTFILE;

    c> open(OUT, '<', 'out') || die "cant open: $!";
    c> @shit = <OUT>;

    why are you writing out the output of lynx JUST TO READ IT BACK IN
    AGAIN? this is the most absurd part of this program.

    you have the text in $temp. you know how to use backticks but why do you
    do the file write and reading back in? if you assigned the backticks to
    an array you would get the same thing as in @shit without the wasted
    effort.

    also calling it @shit is not a good thing.

    c> @product = grep(/Product ID/, @shit);
    c> @id = grep(/Item ID/, @shit);
    c> @sku = grep(/SKU/, @shit);
    c> @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
    c> get some extra data after UPC.

    that is a problem with the format of the html page. html isn't line
    oriented and you are grepping over lines. the proper way to deal with
    html is with a parser. or in special very well defined cases with
    regexes to actually grab what you want from the text. whole html lines
    are almost never what you want.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, May 15, 2008
    #2
    1. Advertising

  3. Guest

    On May 15, 1:37 pm, Uri Guttman <> wrote:
    > >>>>> "c" == chadda <> writes:

    >
    > i have to know if you could write this mess any slower? you are doing
    > everything possible to slow you down.


    I know I shouldn't critize free help, but you seem to have some anger
    management issues.
    >
    > c> open(IN, '<', 'input') || die "cant open: $!";
    > c> $read = <IN>;
    > c> chomp($read);
    > c> $build = "http://www.doba.com/members/catalog/".$read.".html";
    > c> $temp = `lynx -accept_all_cookies -dump $build`;
    >
    > why are you calling out to a program when perl can load web pages just
    > fine with LWP? did you even look for web stuff on cpan?
    >

    Would using LWP speed up the code? By the way, this code is meant to
    run on a server with restricted access. Ie, I can't install stuff from
    cpan on that server.

    > c> open(OUTFILE, '>out');
    > c> print OUTFILE $temp;
    > c> close OUTFILE;
    >
    > c> open(OUT, '<', 'out') || die "cant open: $!";
    > c> @shit = <OUT>;
    >
    > why are you writing out the output of lynx JUST TO READ IT BACK IN
    > AGAIN? this is the most absurd part of this program.
    >
    > you have the text in $temp. you know how to use backticks but why do you
    > do the file write and reading back in? if you assigned the backticks to
    > an array you would get the same thing as in @shit without the wasted
    > effort.
    >
    > also calling it @shit is not a good thing.
    >

    Huh? Are you saying I don't need the 'out' file?

    > c> @product = grep(/Product ID/, @shit);
    > c> @id = grep(/Item ID/, @shit);
    > c> @sku = grep(/SKU/, @shit);
    > c> @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
    > c> get some extra data after UPC.
    >
    > that is a problem with the format of the html page. html isn't line
    > oriented and you are grepping over lines. the proper way to deal with
    > html is with a parser. or in special very well defined cases with
    > regexes to actually grab what you want from the text. whole html lines
    > are almost never what you want.
    >
    > uri
    >
     
    , May 15, 2008
    #3
  4. Guest

    On May 15, 2:21 pm, wrote:
    > On May 15, 1:37 pm, Uri Guttman <> wrote:
    >
    > > >>>>> "c" == chadda <> writes:

    >
    > > i have to know if you could write this mess any slower? you are doing
    > > everything possible to slow you down.

    >
    > I know I shouldn't critize free help, but you seem to have some anger
    > management issues.
    >
    > > c> open(IN, '<', 'input') || die "cant open: $!";
    > > c> $read = <IN>;
    > > c> chomp($read);
    > > c> $build = "http://www.doba.com/members/catalog/".$read.".html";
    > > c> $temp = `lynx -accept_all_cookies -dump $build`;

    >
    > > why are you calling out to a program when perl can load web pages just
    > > fine with LWP? did you even look for web stuff on cpan?

    >
    > Would using LWP speed up the code? By the way, this code is meant to
    > run on a server with restricted access. Ie, I can't install stuff from
    > cpan on that server.
    >
    >
    >
    > > c> open(OUTFILE, '>out');
    > > c> print OUTFILE $temp;
    > > c> close OUTFILE;

    >
    > > c> open(OUT, '<', 'out') || die "cant open: $!";
    > > c> @shit = <OUT>;

    >
    > > why are you writing out the output of lynx JUST TO READ IT BACK IN
    > > AGAIN? this is the most absurd part of this program.

    >
    > > you have the text in $temp. you know how to use backticks but why do you
    > > do the file write and reading back in? if you assigned the backticks to
    > > an array you would get the same thing as in @shit without the wasted
    > > effort.

    >
    > > also calling it @shit is not a good thing.

    >
    > Huh? Are you saying I don't need the 'out' file?


    Maybe something like this?
    % more parse.pl
    #!/usr/bin/perl -w

    my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
    my @temp;

    open(IN, '<', 'input') || die "cant open: $!";
    $read = <IN>;
    chomp($read);
    $build = "http://www.doba.com/members/catalog/".$read.".html";
    @temp = `lynx -accept_all_cookies -dump $build`;

    @product = grep(/Product ID/, @temp);
    @id = grep(/Item ID/, @temp);
    @sku = grep(/SKU/, @temp);
    @upc = grep(/UPC/, @temp);
    @weight = grep(/Weight/, @temp);

    print @product;
    print @id;
    print @sku;
    print @upc;
    print @weight;


    However, I don't know how to use LWP. Again, would the code run faster
    if I used LWP?
     
    , May 15, 2008
    #4
  5. Uri Guttman Guest

    Re: Need ideas on how to make this code faster than a speedingturtle

    >>>>> "c" == chadda <> writes:

    c> On May 15, 1:37 pm, Uri Guttman <> wrote:
    >> >>>>> "c" == chadda <> writes:

    >>
    >> i have to know if you could write this mess any slower? you are doing
    >> everything possible to slow you down.


    c> I know I shouldn't critize free help, but you seem to have some anger
    c> management issues.

    nope. i have bad code anger issues. i deal with this in code reviews all
    the time. i just don't get how people come up with wacky and slow ways
    to do things. i have seen worse code that read in files, parsed them,
    wrote them out (untouched) and read them in again.



    >>

    c> open(IN, '<', 'input') || die "cant open: $!";
    c> $read = <IN>;
    c> chomp($read);
    c> $build = "http://www.doba.com/members/catalog/".$read.".html";
    c> $temp = `lynx -accept_all_cookies -dump $build`;
    >>
    >> why are you calling out to a program when perl can load web pages just
    >> fine with LWP? did you even look for web stuff on cpan?
    >>

    c> Would using LWP speed up the code? By the way, this code is meant to
    c> run on a server with restricted access. Ie, I can't install stuff from
    c> cpan on that server.

    if you have access to load scripts you can load pure perl modules
    too. this is an FAQ.

    c> open(OUTFILE, '>out');
    c> print OUTFILE $temp;
    c> close OUTFILE;
    >>

    c> open(OUT, '<', 'out') || die "cant open: $!";
    c> @shit = <OUT>;
    >>
    >> why are you writing out the output of lynx JUST TO READ IT BACK IN
    >> AGAIN? this is the most absurd part of this program.
    >>
    >> you have the text in $temp. you know how to use backticks but why do you
    >> do the file write and reading back in? if you assigned the backticks to
    >> an array you would get the same thing as in @shit without the wasted
    >> effort.
    >>
    >> also calling it @shit is not a good thing.
    >>

    c> Huh? Are you saying I don't need the 'out' file?

    yes. why do you think you need that file? you call backticks and get the
    html page in $temp. why do you think you need a file to process that
    data? you already have it inside perl.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, May 15, 2008
    #5
  6. Uri Guttman Guest

    Re: Need ideas on how to make this code faster than a speedingturtle

    >>>>> "c" == chadda <> writes:

    >> Huh? Are you saying I don't need the 'out' file?


    yes.

    c> Maybe something like this?
    c> % more parse.pl
    c> #!/usr/bin/perl -w

    c> my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
    c> my @temp;

    c> open(IN, '<', 'input') || die "cant open: $!";
    c> $read = <IN>;
    c> chomp($read);
    c> $build = "http://www.doba.com/members/catalog/".$read.".html";
    c> @temp = `lynx -accept_all_cookies -dump $build`;

    c> @product = grep(/Product ID/, @temp);
    c> @id = grep(/Item ID/, @temp);
    c> @sku = grep(/SKU/, @temp);
    c> @upc = grep(/UPC/, @temp);
    c> @weight = grep(/Weight/, @temp);

    c> print @product;
    c> print @id;
    c> print @sku;
    c> print @upc;
    c> print @weight;


    c> However, I don't know how to use LWP. Again, would the code run faster
    c> if I used LWP?

    better but forking off lynx is still slow. LWP should be much faster. if
    you want speed (and with the data size you have, you want it), use LWP.

    depending on how fast you need it (cpu usage will spike with the greps
    you have) you can also change all that to parse out what you want with
    regexes. (again, that assumes a known fixed html page layout which you
    seem to have).

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, May 15, 2008
    #6
  7. Gordon Etly Guest

    wrote:
    > On May 15, 1:37 pm, Uri Guttman <> wrote:
    > chadda <> writes:


    > > i have to know if you could write this mess any slower? you are
    > > doing
    > > everything possible to slow you down.


    > I know I shouldn't critize free help, but you seem to have some anger
    > management issues.


    He seems to constantly come across this way. I really wish he could see
    things from other points of view.
    ....


    As a simple answer, take a look at LWP:UserAgent
    (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    as a good start in the right direction.

    --
    G.Etly
     
    Gordon Etly, May 15, 2008
    #7
  8. "Gordon Etly" <> wrote in
    news::

    > wrote:
    >> On May 15, 1:37 pm, Uri Guttman <> wrote:
    >> chadda <> writes:

    >
    >> > i have to know if you could write this mess any slower? you are
    >> > doing
    >> > everything possible to slow you down.

    >
    >> I know I shouldn't critize free help, but you seem to have some anger
    >> management issues.

    >
    > He seems to constantly come across this way. I really wish he could
    > see things from other points of view.
    > ...
    >
    >
    > As a simple answer, take a look at LWP:UserAgent
    > (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    > as a good start in the right direction.


    All the OP needs is LWP::Simple and HTML::TableExtract.

    In fact, I wrote a whole script that took only 0.8 seconds to download
    and parse a single page (of course, with more id's in a file, the only
    real limit on the speed is the network latency and transfer speed) but I
    have decided not to post it as I do not know what his intentions are.

    As for you, pick a posting id and stick with it.

    PLONKETY PLONK!

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, May 15, 2008
    #8
  9. Guest

    On May 15, 3:16 pm, "Gordon Etly" <> wrote:
    > wrote:
    > > On May 15, 1:37 pm, Uri Guttman <> wrote:
    > > chadda <> writes:
    > > > i have to know if you could write this mess any slower? you are
    > > > doing
    > > > everything possible to slow you down.

    > > I know I shouldn't critize free help, but you seem to have some anger
    > > management issues.

    >
    > He seems to constantly come across this way. I really wish he could see
    > things from other points of view.
    > ...
    >
    > As a simple answer, take a look at LWP:UserAgent
    > (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    > as a good start in the right direction.
    >
    > --
    > G.Etly



    I just tried LWP, and now I can't get the code to work for the life of
    me. Here is what I attempted

    #!/usr/bin/perl -w

    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Cookies;

    my ($read, $build, @product, @id, @sku, @upc, @weight);
    my @temp;

    open(IN, '<', 'input') || die "cant open: $!";
    $read = <IN>;
    chomp($read);
    $build = 'http://www.doba.com/members/catalog/'.$read.'.html';
    #@temp = `lynx -accept_all_cookies -dump $build`;

    my $ua = LWP::UserAgent->new;
    $ua->agent("OMEGA SPARC DESTROYER/69");

    my $request = HTTP::Request->new('GET');
    $request->url($build);

    my $cookie_jar = HTTP::Cookies->new;
    $cookie_jar->add_cookie_header($request);

    my $response = $ua->request($request);

    my $code = $response->code;
    print $code;

    @temp = $request->content;

    @product = grep(/Product ID/, @temp);
    @id = grep(/Item ID/, @temp);
    @sku = grep(/SKU/, @temp);
    @upc = grep(/UPC/, @temp);
    @weight = grep(/Weight/, @temp);

    print @product;
    print @id;
    print @sku;
    print @upc;
    print @weight;

    % ./parse.pl
    500%
     
    , May 16, 2008
    #9
  10. wrote in
    news::

    > On May 15, 3:16 pm, "Gordon Etly" <> wrote:
    >> wrote:
    >> > On May 15, 1:37 pm, Uri Guttman <> wrote:
    >> > chadda <> writes:
    >> > > i have to know if you could write this mess any slower? you are
    >> > > doing
    >> > > everything possible to slow you down.
    >> > I know I shouldn't critize free help, but you seem to have some
    >> > anger management issues.


    ....

    >> As a simple answer, take a look at LWP:UserAgent
    >> (http://search.cpan.org/~gaas/libwww-perl-

    5.812/lib/LWP/UserAgent.pm),
    >> as a good start in the right direction.


    ....

    > I just tried LWP, and now I can't get the code to work for the life of
    > me. Here is what I attempted


    As I mentioned elsewhere, all you need is LWP::Simple.

    So, here is a fish for you:

    C:\Temp> cat p.pl
    #!/usr/bin/perl

    use strict;
    use warnings;

    use HTML::TokeParser;
    use LWP::Simple;


    my ($input_file) = @ARGV;
    die "No input file specified\n" unless defined $input_file;

    open my $INPUT, '<', $input_file
    or die "Cannot open '$input_file': $!";

    ID:
    while ( my $id = <$INPUT> ) {
    chomp $id;

    my $url = make_url( $id );
    my $html = get $url;

    unless ( defined $html ) {
    warn "Error downloading from '$url'\n";
    next ID;
    }

    my $parser = HTML::TokeParser->new( \$html );

    TABLE:
    while ( my $token = $parser->get_tag('table') ) {
    if ( lc $token->[1]{id} eq 'product_details' ) {
    my $td = $parser->get_tag('td');
    last TABLE unless $td;
    my $cell = $parser->get_text('/td');
    my %data;
    while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
    $data{$1} = $2;
    }
    use Data::Dumper;
    print Dumper \%data;
    }
    }
    }

    sub make_url {
    return
    sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];
    }

    __END__

    C:\Temp> timethis p list

    $VAR1 = {
    'Product ID' => '3308191',
    'UPC' => '896207999816',
    'Item ID' => '3653992',
    'SKU' => '8930'
    };

    TimeThis : Command Line : p list
    TimeThis : Start Time : Thu May 15 18:19:28 2008
    TimeThis : End Time : Thu May 15 18:19:29 2008
    TimeThis : Elapsed Time : 00:00:01.062

    Comparing this to the overhead of an empty script:

    C:\Temp> cat t.pl
    #!/usr/bin/perl

    use strict;
    use warnings;

    C:\Temp> timethis t

    TimeThis : Command Line : t
    TimeThis : Start Time : Thu May 15 18:20:38 2008
    TimeThis : End Time : Thu May 15 18:20:38 2008
    TimeThis : Elapsed Time : 00:00:00.218

    It took 0.844 seconds to retrieve and parse the required information. Of
    course, the time cost would be better amortized if you ran a lot of
    these queries.



    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, May 16, 2008
    #10
  11. Uri Guttman Guest

    Re: Need ideas on how to make this code faster than a speedingturtle

    >>>>> "GE" == Gordon Etly <> writes:

    GE> wrote:
    >> On May 15, 1:37 pm, Uri Guttman <> wrote:
    >> chadda <> writes:


    >> > i have to know if you could write this mess any slower? you are
    >> > doing
    >> > everything possible to slow you down.


    >> I know I shouldn't critize free help, but you seem to have some anger
    >> management issues.


    GE> He seems to constantly come across this way. I really wish he could see
    GE> things from other points of view.
    GE> ...

    as usual, no help from you.

    GE> As a simple answer, take a look at LWP:UserAgent
    GE> (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    GE> as a good start in the right direction.

    which i already told him and we have already improved his code a good
    deal. try to keep up.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, May 16, 2008
    #11
  12. Guest

    On May 15, 4:14 pm, "A. Sinan Unur" <> wrote:
    > wrote innews::
    >
    > > On May 15, 3:16 pm, "Gordon Etly" <> wrote:
    > >> wrote:
    > >> > On May 15, 1:37 pm, Uri Guttman <> wrote:
    > >> > chadda <> writes:
    > >> > > i have to know if you could write this mess any slower? you are
    > >> > > doing
    > >> > > everything possible to slow you down.
    > >> > I know I shouldn't critize free help, but you seem to have some
    > >> > anger management issues.

    >
    > ...
    >
    > >> As a simple answer, take a look at LWP:UserAgent
    > >> (http://search.cpan.org/~gaas/libwww-perl-

    >
    > 5.812/lib/LWP/UserAgent.pm),
    >
    > >> as a good start in the right direction.

    >
    > ...
    >
    > > I just tried LWP, and now I can't get the code to work for the life of
    > > me. Here is what I attempted

    >
    > As I mentioned elsewhere, all you need is LWP::Simple.
    >
    > So, here is a fish for you:
    >
    > C:\Temp> cat p.pl
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use HTML::TokeParser;
    > use LWP::Simple;
    >
    > my ($input_file) = @ARGV;
    > die "No input file specified\n" unless defined $input_file;
    >
    > open my $INPUT, '<', $input_file
    > or die "Cannot open '$input_file': $!";
    >
    > ID:
    > while ( my $id = <$INPUT> ) {
    > chomp $id;
    >
    > my $url = make_url( $id );
    > my $html = get $url;
    >
    > unless ( defined $html ) {
    > warn "Error downloading from '$url'\n";
    > next ID;
    > }
    >
    > my $parser = HTML::TokeParser->new( \$html );
    >
    > TABLE:
    > while ( my $token = $parser->get_tag('table') ) {
    > if ( lc $token->[1]{id} eq 'product_details' ) {
    > my $td = $parser->get_tag('td');
    > last TABLE unless $td;
    > my $cell = $parser->get_text('/td');
    > my %data;
    > while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
    > $data{$1} = $2;
    > }
    > use Data::Dumper;
    > print Dumper \%data;
    > }
    > }
    >
    > }
    >
    > sub make_url {
    > return
    > sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];
    >
    > }
    >
    > __END__
    >
    > C:\Temp> timethis p list
    >
    > $VAR1 = {
    > 'Product ID' => '3308191',
    > 'UPC' => '896207999816',
    > 'Item ID' => '3653992',
    > 'SKU' => '8930'
    > };
    >
    > TimeThis : Command Line : p list
    > TimeThis : Start Time : Thu May 15 18:19:28 2008
    > TimeThis : End Time : Thu May 15 18:19:29 2008
    > TimeThis : Elapsed Time : 00:00:01.062
    >
    > Comparing this to the overhead of an empty script:
    >
    > C:\Temp> cat t.pl
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > C:\Temp> timethis t
    >
    > TimeThis : Command Line : t
    > TimeThis : Start Time : Thu May 15 18:20:38 2008
    > TimeThis : End Time : Thu May 15 18:20:38 2008
    > TimeThis : Elapsed Time : 00:00:00.218
    >
    > It took 0.844 seconds to retrieve and parse the required information. Of
    > course, the time cost would be better amortized if you ran a lot of
    > these queries.
    >
    > --
    > A. Sinan Unur <>
    > (remove .invalid and reverse each component for email address)
    >
    > comp.lang.perl.misc guidelines on the WWW:http://www.rehabitation.com/clpmisc/



    When I try to run this code, I keep getting a blank url.
     
    , May 16, 2008
    #12
  13. wrote in
    news::

    [ Do not quote in full. Do not quote sigs. ]

    > On May 15, 4:14 pm, "A. Sinan Unur" <> wrote:


    >> So, here is a fish for you:
    >>
    >> C:\Temp> cat p.pl
    >> #!/usr/bin/perl
    >>
    >> use strict;
    >> use warnings;
    >>
    >> use HTML::TokeParser;
    >> use LWP::Simple;
    >>
    >> my ($input_file) = @ARGV;
    >> die "No input file specified\n" unless defined $input_file;
    >>
    >> open my $INPUT, '<', $input_file
    >> or die "Cannot open '$input_file': $!";
    >>
    >> ID:
    >> while ( my $id = <$INPUT> ) {
    >> chomp $id;
    >>
    >> my $url = make_url( $id );
    >> my $html = get $url;
    >>
    >> unless ( defined $html ) {
    >> warn "Error downloading from '$url'\n";
    >> next ID;
    >> }
    >>
    >> my $parser = HTML::TokeParser->new( \$html );
    >>
    >> TABLE:
    >> while ( my $token = $parser->get_tag('table') ) {
    >> if ( lc $token->[1]{id} eq 'product_details' ) {
    >> my $td = $parser->get_tag('td');
    >> last TABLE unless $td;
    >> my $cell = $parser->get_text('/td');
    >> my %data;
    >> while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
    >> $data{$1} = $2;
    >> }
    >> use Data::Dumper;
    >> print Dumper \%data;
    >> }
    >> }
    >>
    >> }
    >>
    >> sub make_url {
    >> return
    >> sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];
    >>
    >> }
    >>
    >> __END__


    ....

    > When I try to run this code, I keep getting a blank url.


    Well, did you provide it with a file containing the id numbers? How do
    you know the URL is blank? Did you modify the code? If you did, why did
    you not post the relevant modifications?

    I would have normally put the id number in the __DATA__ section, but
    since you implied that you already had an input file with id numbers, I
    followed your example.

    In any case, unless you take active steps to help others help you, this
    will be the sum total of the help I will provide you.

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, May 16, 2008
    #13
  14. wrote:
    >I 'll eventually have the input file filled with 350 million items.
    >Right now there is only one
    >
    >$more input
    >3308191
    >
    >The following program reads in the number from the file named 'input'
    >and builds a url form this number. Then it builds a url from this
    >number. I have lynx then dump the data into a file called 'out' and
    >then just grep the entire thing for the Product Number, Product ID,
    >SKU, UPC, and weight.
    >
    >
    >m-net% more parse.pl
    >#!/usr/bin/perl -w
    >
    >my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
    >my $temp;
    >
    >open(IN, '<', 'input') || die "cant open: $!";
    >$read = <IN>;


    I suppose you want to turn that line into a while loop once you got more
    than one single item to process.
    However, considering network latency and response times it may very well
    be worthwhile to trigger multiple HTTP requests in parallel, such that
    your processing code will never have to wait for network responses.

    Other issues like shelling out an expensive external process, that
    expensive but useless temporary file, or trying to parse HTML code using
    REs others already mentioned.

    jue
     
    Jürgen Exner, May 16, 2008
    #14
  15. "Gordon Etly" <> wrote:
    >He seems to constantly come across this way. I really wish he could see
    >things from other points of view.


    Are you the same moron you went into my killfile a few days ago as
    Gordon Etly <>? I guess everyone had filtered you so you
    had to create a new identity, right? Back you go where you came from!

    >As a simple answer, take a look at LWP:UserAgent
    >(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    >as a good start in the right direction.


    Yeah, it's easy enough to copy what other people had mentioned already.

    jue
     
    Jürgen Exner, May 16, 2008
    #15
  16. Re: Need ideas on how to make this code faster than a speedingturtle

    [A complimentary Cc of this posting was sent to
    Uri Guttman
    <>], who wrote in article <>:
    > better but forking off lynx is still slow. LWP should be much faster. if
    > you want speed (and with the data size you have, you want it), use LWP.


    This may depend on many parameters, but the overhead of system()ing
    may be quite low. The overhead of opening a new HTTP connection for
    each line may be larger. LWP will have a chance to use persistent
    connections...

    Yours,
    Ilya
     
    Ilya Zakharevich, May 16, 2008
    #16
  17. Dr.Ruud Guest

    schreef:

    > I know I shouldn't critize free help, but you seem to have some anger
    > management issues.


    *plonk*

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, May 16, 2008
    #17
  18. Gordon Etly Guest

    Uri Guttman wrote:
    > Gordon Etly <> writes:


    [please don't left pad quoted text with spaces]

    > > wrote:
    > > > On May 15, 1:37 pm, Uri Guttman <> wrote:
    > > > chadda <> writes:


    > > > > i have to know if you could write this mess any slower? you are
    > > > > doing
    > > > > everything possible to slow you down.


    > > > I know I shouldn't critize free help, but you seem to have some
    > > > anger management issues.


    > > He seems to constantly come across this way. I really wish he could
    > > see things from other points of view.
    > > ...


    > as usual, no help from you.


    I'm just pointing out what is. It's you who keep bringing this upon
    yourself. You are constantly rude and arrogant to people, then you
    wonder why people sometimes post back, like the OP did. If you can't
    handle receiving comments about what you post, then don't post. If you
    can't take it, don't dish it out.

    > > As a simple answer, take a look at LWP:UserAgent
    > > (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    > > as a good start in the right direction.


    > which i already told him and we have already improved his code a good
    > deal. try to keep up.


    I would think someone who has been on UseNet as logn as you would know
    that posts don't always come down at the same time (or order) from every
    server. Case in point, I had not seen such a post mentioning it until
    later on.

    --
    G.Etly
     
    Gordon Etly, May 16, 2008
    #18
  19. "Gordon Etly" <> wrote:
    >I'm just pointing out what is. It's you who keep bringing this upon
    >yourself. You are constantly rude and arrogant to people, then you


    Changing your identity again because everyone filtered you?

    jue
     
    Jürgen Exner, May 16, 2008
    #19
  20. Gordon Etly Guest

    Jürgen Exner wrote:
    > "Gordon Etly" <> wrote:


    > > He seems to constantly come across this way. I really wish he could
    > > see things from other points of view.


    > I guess everyone had filtered you so you had to create a new identity


    I have not changed my identity. My name is Gordon Etly. I have not
    changed that part, nor made any attempt to hide it, so your statement is
    false.

    I happen to be a sys op for the company I work for, including our mail
    server, so I am able to add entries to /etc/aliases (which I commonly
    use to public variants of my main email address that any unwanted
    mailings can be easily stopped.) I've never seen any rule saying "never
    change your email field", as that is anyone's right.

    > > As a simple answer, take a look at LWP:UserAgent
    > > (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
    > > as a good start in the right direction.


    > Yeah, it's easy enough to copy what other people had mentioned
    > already.


    I had not seen that mentioned at all before I posted. Funny, I see you
    and your fellows do exactly this all the time (posting essentially the
    same answer that was already given by someone else), but now it's
    suddenly a bad thing. Please make up your minds.

    In this case, there were no replies mentioning LWP::UserAgent. Uri did
    mention LWP very briefly, but LWP has several modules. I was more
    specific.

    --
    G.Etly
     
    Gordon Etly, May 16, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jevitop
    Replies:
    2
    Views:
    5,951
    Stewart Gordon
    Sep 17, 2003
  2. Dave Angel
    Replies:
    4
    Views:
    297
    Piet van Oostrum
    Jul 6, 2009
  3. Brent Patroch

    CDO Bulk Email Help - Need to make it faster

    Brent Patroch, Sep 16, 2004, in forum: ASP General
    Replies:
    0
    Views:
    317
    Brent Patroch
    Sep 16, 2004
  4. Adam Funk
    Replies:
    7
    Views:
    218
    Adam Funk
    Feb 6, 2013
  5. Pascal Bit
    Replies:
    4
    Views:
    82
    alex23
    Nov 11, 2013
Loading...

Share This Page