Need ideas on how to make this code faster than a speeding turtle

chadda · May 15, 2008

I 'll eventually have the input file filled with 350 million items.
Right now there is only one

$more input
3308191

The following program reads in the number from the file named 'input'
and builds a url form this number. Then it builds a url from this
number. I have lynx then dump the data into a file called 'out' and
then just grep the entire thing for the Product Number, Product ID,
SKU, UPC, and weight.

m-net% more parse.pl
#!/usr/bin/perl -w

my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
my $temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = "http://www.doba.com/members/catalog/".$read.".html";
$temp = `lynx -accept_all_cookies -dump $build`;
open(OUTFILE, '>out');
print OUTFILE $temp;
close OUTFILE;

open(OUT, '<', 'out') || die "cant open: $!";
@shit = <OUT>;

@product = grep(/Product ID/, @shit);
@id = grep(/Item ID/, @shit);
@sku = grep(/SKU/, @shit);
@upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
get some extra data after UPC.
@weight = grep(/Weight/, @shit);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

% ./parse.pl
Product ID: 3308191
Item ID: 3653992
SKU: 8930
UPC: 896207999816 Condition: refurbished
Weight: 4.7 lbs.

Uri Guttman · May 15, 2008

i have to know if you could write this mess any slower? you are doing
everything possible to slow you down.

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;

why are you calling out to a program when perl can load web pages just
fine with LWP? did you even look for web stuff on cpan?

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;

c> open(OUT, '<', 'out') || die "cant open: $!";
c> @shit = <OUT>;

why are you writing out the output of lynx JUST TO READ IT BACK IN
AGAIN? this is the most absurd part of this program.

you have the text in $temp. you know how to use backticks but why do you
do the file write and reading back in? if you assigned the backticks to
an array you would get the same thing as in @shit without the wasted
effort.

also calling it @shit is not a good thing.

c> @product = grep(/Product ID/, @shit);
c> @id = grep(/Item ID/, @shit);
c> @sku = grep(/SKU/, @shit);
c> @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
c> get some extra data after UPC.

that is a problem with the format of the html page. html isn't line
oriented and you are grepping over lines. the proper way to deal with
html is with a parser. or in special very well defined cases with
regexes to actually grab what you want from the text. whole html lines
are almost never what you want.

uri

chadda · May 15, 2008

i have to know if you could write this mess any slower? you are doing
everything possible to slow you down.

I know I shouldn't critize free help, but you seem to have some anger
management issues.

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;

why are you calling out to a program when perl can load web pages just
fine with LWP? did you even look for web stuff on cpan?

Would using LWP speed up the code? By the way, this code is meant to
run on a server with restricted access. Ie, I can't install stuff from
cpan on that server.

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;

c> open(OUT, '<', 'out') || die "cant open: $!";
c> @shit = <OUT>;

why are you writing out the output of lynx JUST TO READ IT BACK IN
AGAIN? this is the most absurd part of this program.

you have the text in $temp. you know how to use backticks but why do you
do the file write and reading back in? if you assigned the backticks to
an array you would get the same thing as in @shit without the wasted
effort.

also calling it @shit is not a good thing.

Huh? Are you saying I don't need the 'out' file?

chadda · May 15, 2008

I know I shouldn't critize free help, but you seem to have some anger
management issues.

Would using LWP speed up the code? By the way, this code is meant to
run on a server with restricted access. Ie, I can't install stuff from
cpan on that server.

Huh? Are you saying I don't need the 'out' file?

Maybe something like this?
% more parse.pl
#!/usr/bin/perl -w

my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
my @temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = "http://www.doba.com/members/catalog/".$read.".html";
@temp = `lynx -accept_all_cookies -dump $build`;

@product = grep(/Product ID/, @temp);
@id = grep(/Item ID/, @temp);
@sku = grep(/SKU/, @temp);
@upc = grep(/UPC/, @temp);
@weight = grep(/Weight/, @temp);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

However, I don't know how to use LWP. Again, would the code run faster
if I used LWP?

Uri Guttman · May 15, 2008

c> I know I shouldn't critize free help, but you seem to have some anger
c> management issues.

nope. i have bad code anger issues. i deal with this in code reviews all
the time. i just don't get how people come up with wacky and slow ways
to do things. i have seen worse code that read in files, parsed them,
wrote them out (untouched) and read them in again.

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;c> Would using LWP speed up the code? By the way, this code is meant to
c> run on a server with restricted access. Ie, I can't install stuff from
c> cpan on that server.

if you have access to load scripts you can load pure perl modules
too. this is an FAQ.

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;c> open(OUT, '<', 'out') || die "cant open: $!";
c> Huh? Are you saying I don't need the 'out' file?

yes. why do you think you need that file? you call backticks and get the
html page in $temp. why do you think you need a file to process that
data? you already have it inside perl.

uri

Uri Guttman · May 15, 2008

yes.

c> Maybe something like this?
c> % more parse.pl
c> #!/usr/bin/perl -w

c> my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
c> my @temp;

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> @temp = `lynx -accept_all_cookies -dump $build`;

c> @product = grep(/Product ID/, @temp);
c> @id = grep(/Item ID/, @temp);
c> @sku = grep(/SKU/, @temp);
c> @upc = grep(/UPC/, @temp);
c> @weight = grep(/Weight/, @temp);

c> print @product;
c> print @id;
c> print @sku;
c> print @upc;
c> print @weight;

c> However, I don't know how to use LWP. Again, would the code run faster
c> if I used LWP?

better but forking off lynx is still slow. LWP should be much faster. if
you want speed (and with the data size you have, you want it), use LWP.

depending on how fast you need it (cpu usage will spike with the greps
you have) you can also change all that to parse out what you want with
regexes. (again, that assumes a known fixed html page layout which you
seem to have).

uri

Gordon Etly · May 15, 2008

I know I shouldn't critize free help, but you seem to have some anger
management issues.

He seems to constantly come across this way. I really wish he could see
things from other points of view.
....

As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
as a good start in the right direction.

A. Sinan Unur · May 15, 2008

He seems to constantly come across this way. I really wish he could
see things from other points of view.
...

As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
as a good start in the right direction.

All the OP needs is LWP::Simple and HTML::TableExtract.

In fact, I wrote a whole script that took only 0.8 seconds to download
and parse a single page (of course, with more id's in a file, the only
real limit on the speed is the network latency and transfer speed) but I
have decided not to post it as I do not know what his intentions are.

As for you, pick a posting id and stick with it.

PLONKETY PLONK!

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

chadda · May 15, 2008

He seems to constantly come across this way. I really wish he could see
things from other points of view.
...

As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
as a good start in the right direction.

I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

#!/usr/bin/perl -w

use LWP::UserAgent;
use HTTP::Request;
use HTTP::Cookies;

my ($read, $build, @product, @id, @sku, @upc, @weight);
my @temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = 'http://www.doba.com/members/catalog/'.$read.'.html';
#@temp = `lynx -accept_all_cookies -dump $build`;

my $ua = LWP::UserAgent->new;
$ua->agent("OMEGA SPARC DESTROYER/69");

my $request = HTTP::Request->new('GET');
$request->url($build);

my $cookie_jar = HTTP::Cookies->new;
$cookie_jar->add_cookie_header($request);

my $response = $ua->request($request);

my $code = $response->code;
print $code;

@temp = $request->content;

@product = grep(/Product ID/, @temp);
@id = grep(/Item ID/, @temp);
@sku = grep(/SKU/, @temp);
@upc = grep(/UPC/, @temp);
@weight = grep(/Weight/, @temp);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

% ./parse.pl
500%

A. Sinan Unur · May 15, 2008

(e-mail address removed) wrote in

....

....

I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

As I mentioned elsewhere, all you need is LWP::Simple.

So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;

my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data:

umper;
print Dumper \%data;
}
}
}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];
}

__END__

C:\Temp> timethis p list

$VAR1 = {
'Product ID' => '3308191',
'UPC' => '896207999816',
'Item ID' => '3653992',
'SKU' => '8930'
};

TimeThis : Command Line : p list
TimeThis : Start Time : Thu May 15 18:19:28 2008
TimeThis : End Time : Thu May 15 18:19:29 2008
TimeThis : Elapsed Time : 00:00:01.062

Comparing this to the overhead of an empty script:

C:\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

C:\Temp> timethis t

TimeThis : Command Line : t
TimeThis : Start Time : Thu May 15 18:20:38 2008
TimeThis : End Time : Thu May 15 18:20:38 2008
TimeThis : Elapsed Time : 00:00:00.218

It took 0.844 seconds to retrieve and parse the required information. Of
course, the time cost would be better amortized if you ran a lot of
these queries.

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Uri Guttman · May 15, 2008

GE> He seems to constantly come across this way. I really wish he could see
GE> things from other points of view.
GE> ...

as usual, no help from you.

GE> As a simple answer, take a look at LWP:UserAgent
GE> (http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
GE> as a good start in the right direction.

which i already told him and we have already improved his code a good
deal. try to keep up.

uri

chadda · May 15, 2008

(e-mail address removed) wrote in

...

5.812/lib/LWP/UserAgent.pm),

...

I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

Click to expand...

As I mentioned elsewhere, all you need is LWP::Simple.

So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;

my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data:umper;
print Dumper \%data;
}
}

}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];

}

__END__

C:\Temp> timethis p list

$VAR1 = {
'Product ID' => '3308191',
'UPC' => '896207999816',
'Item ID' => '3653992',
'SKU' => '8930'
};

TimeThis : Command Line : p list
TimeThis : Start Time : Thu May 15 18:19:28 2008
TimeThis : End Time : Thu May 15 18:19:29 2008
TimeThis : Elapsed Time : 00:00:01.062

Comparing this to the overhead of an empty script:

C:\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

C:\Temp> timethis t

TimeThis : Command Line : t
TimeThis : Start Time : Thu May 15 18:20:38 2008
TimeThis : End Time : Thu May 15 18:20:38 2008
TimeThis : Elapsed Time : 00:00:00.218

It took 0.844 seconds to retrieve and parse the required information. Of
course, the time cost would be better amortized if you ran a lot of
these queries.

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:http://www.rehabitation.com/clpmisc/

When I try to run this code, I keep getting a blank url.

A. Sinan Unur · May 15, 2008

(e-mail address removed) wrote in

[ Do not quote in full. Do not quote sigs. ]

So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;

my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data:umper;
print Dumper \%data;
}
}

}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];

}

__END__

Click to expand...

....

When I try to run this code, I keep getting a blank url.

Well, did you provide it with a file containing the id numbers? How do
you know the URL is blank? Did you modify the code? If you did, why did
you not post the relevant modifications?

I would have normally put the id number in the __DATA__ section, but
since you implied that you already had an input file with id numbers, I
followed your example.

In any case, unless you take active steps to help others help you, this
will be the sum total of the help I will provide you.

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Jürgen Exner · May 15, 2008

I 'll eventually have the input file filled with 350 million items.
Right now there is only one

$more input
3308191

The following program reads in the number from the file named 'input'
and builds a url form this number. Then it builds a url from this
number. I have lynx then dump the data into a file called 'out' and
then just grep the entire thing for the Product Number, Product ID,
SKU, UPC, and weight.

m-net% more parse.pl
#!/usr/bin/perl -w

my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
my $temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;

I suppose you want to turn that line into a while loop once you got more
than one single item to process.
However, considering network latency and response times it may very well
be worthwhile to trigger multiple HTTP requests in parallel, such that
your processing code will never have to wait for network responses.

Other issues like shelling out an expensive external process, that
expensive but useless temporary file, or trying to parse HTML code using
REs others already mentioned.

jue

Jürgen Exner · May 15, 2008

Gordon Etly said:
He seems to constantly come across this way. I really wish he could see
things from other points of view.

Are you the same moron you went into my killfile a few days ago as

As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
as a good start in the right direction.

Yeah, it's easy enough to copy what other people had mentioned already.

jue

Ilya Zakharevich · May 16, 2008

[A complimentary Cc of this posting was sent to
Uri Guttman

better but forking off lynx is still slow. LWP should be much faster. if
you want speed (and with the data size you have, you want it), use LWP.

This may depend on many parameters, but the overhead of system()ing
may be quite low. The overhead of opening a new HTTP connection for
each line may be larger. LWP will have a chance to use persistent
connections...

Yours,
Ilya

Dr.Ruud · May 16, 2008

(e-mail address removed) schreef:

I know I shouldn't critize free help, but you seem to have some anger
management issues.

*plonk*

Gordon Etly · May 16, 2008

Uri said:
Gordon Etly <[email protected]> writes:

[please don't left pad quoted text with spaces]

as usual, no help from you.

I'm just pointing out what is. It's you who keep bringing this upon
yourself. You are constantly rude and arrogant to people, then you
wonder why people sometimes post back, like the OP did. If you can't
handle receiving comments about what you post, then don't post. If you
can't take it, don't dish it out.

which i already told him and we have already improved his code a good
deal. try to keep up.

I would think someone who has been on UseNet as logn as you would know
that posts don't always come down at the same time (or order) from every
server. Case in point, I had not seen such a post mentioning it until
later on.

Jürgen Exner · May 16, 2008

Gordon Etly said:
I'm just pointing out what is. It's you who keep bringing this upon
yourself. You are constantly rude and arrogant to people, then you

Changing your identity again because everyone filtered you?

jue

Gordon Etly · May 16, 2008

I guess everyone had filtered you so you had to create a new identity

I have not changed my identity. My name is Gordon Etly. I have not
changed that part, nor made any attempt to hide it, so your statement is
false.

I happen to be a sys op for the company I work for, including our mail
server, so I am able to add entries to /etc/aliases (which I commonly
use to public variants of my main email address that any unwanted
mailings can be easily stopped.) I've never seen any rule saying "never
change your email field", as that is anyone's right.

Yeah, it's easy enough to copy what other people had mentioned
already.

I had not seen that mentioned at all before I posted. Funny, I see you
and your fellows do exactly this all the time (posting essentially the
same answer that was already given by someone else), but now it's
suddenly a bad thing. Please make up your minds.

In this case, there were no replies mentioning LWP::UserAgent. Uri did
mention LWP very briefly, but LWP has several modules. I was more
specific.

Is there a faster way to do this?	7	Aug 5, 2008
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 21, 2007
FAQ 5.2 How do I change, delete, or insert a line in a file, or append to the beginning of a file?	0	Feb 24, 2011
Can't make this page work	6	Mar 8, 2006
reading input from a file - changing new line character	2	Dec 18, 2006
Changing this code to a function? (how do you make functions?)	1	Nov 29, 2003
help on how to save/load this data structure?	5	May 18, 2005
create a hierarchical list from a text file	5	Jan 15, 2006

Need ideas on how to make this code faster than a speeding turtle

chadda

Uri Guttman

chadda

chadda

Uri Guttman

Uri Guttman

Gordon Etly

A. Sinan Unur

chadda

A. Sinan Unur

Uri Guttman

chadda

A. Sinan Unur

Jürgen Exner

Jürgen Exner

Ilya Zakharevich

Dr.Ruud

Gordon Etly

Jürgen Exner

Gordon Etly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads