Need ideas on how to make this code faster than a speeding turtle

C

chadda

I 'll eventually have the input file filled with 350 million items.
Right now there is only one

$more input
3308191

The following program reads in the number from the file named 'input'
and builds a url form this number. Then it builds a url from this
number. I have lynx then dump the data into a file called 'out' and
then just grep the entire thing for the Product Number, Product ID,
SKU, UPC, and weight.


m-net% more parse.pl
#!/usr/bin/perl -w

my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
my $temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = "http://www.doba.com/members/catalog/".$read.".html";
$temp = `lynx -accept_all_cookies -dump $build`;
open(OUTFILE, '>out');
print OUTFILE $temp;
close OUTFILE;

open(OUT, '<', 'out') || die "cant open: $!";
@shit = <OUT>;

@product = grep(/Product ID/, @shit);
@id = grep(/Item ID/, @shit);
@sku = grep(/SKU/, @shit);
@upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
get some extra data after UPC.
@weight = grep(/Weight/, @shit);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

% ./parse.pl
Product ID: 3308191
Item ID: 3653992
SKU: 8930
UPC: 896207999816 Condition: refurbished
Weight: 4.7 lbs.
 
U

Uri Guttman

i have to know if you could write this mess any slower? you are doing
everything possible to slow you down.

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;

why are you calling out to a program when perl can load web pages just
fine with LWP? did you even look for web stuff on cpan?

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;

c> open(OUT, '<', 'out') || die "cant open: $!";
c> @shit = <OUT>;

why are you writing out the output of lynx JUST TO READ IT BACK IN
AGAIN? this is the most absurd part of this program.

you have the text in $temp. you know how to use backticks but why do you
do the file write and reading back in? if you assigned the backticks to
an array you would get the same thing as in @shit without the wasted
effort.

also calling it @shit is not a good thing.

c> @product = grep(/Product ID/, @shit);
c> @id = grep(/Item ID/, @shit);
c> @sku = grep(/SKU/, @shit);
c> @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I
c> get some extra data after UPC.

that is a problem with the format of the html page. html isn't line
oriented and you are grepping over lines. the proper way to deal with
html is with a parser. or in special very well defined cases with
regexes to actually grab what you want from the text. whole html lines
are almost never what you want.

uri
 
C

chadda

i have to know if you could write this mess any slower? you are doing
everything possible to slow you down.

I know I shouldn't critize free help, but you seem to have some anger
management issues.
c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;

why are you calling out to a program when perl can load web pages just
fine with LWP? did you even look for web stuff on cpan?
Would using LWP speed up the code? By the way, this code is meant to
run on a server with restricted access. Ie, I can't install stuff from
cpan on that server.
c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;

c> open(OUT, '<', 'out') || die "cant open: $!";
c> @shit = <OUT>;

why are you writing out the output of lynx JUST TO READ IT BACK IN
AGAIN? this is the most absurd part of this program.

you have the text in $temp. you know how to use backticks but why do you
do the file write and reading back in? if you assigned the backticks to
an array you would get the same thing as in @shit without the wasted
effort.

also calling it @shit is not a good thing.
Huh? Are you saying I don't need the 'out' file?
 
C

chadda

I know I shouldn't critize free help, but you seem to have some anger
management issues.



Would using LWP speed up the code? By the way, this code is meant to
run on a server with restricted access. Ie, I can't install stuff from
cpan on that server.








Huh? Are you saying I don't need the 'out' file?

Maybe something like this?
% more parse.pl
#!/usr/bin/perl -w

my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
my @temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = "http://www.doba.com/members/catalog/".$read.".html";
@temp = `lynx -accept_all_cookies -dump $build`;

@product = grep(/Product ID/, @temp);
@id = grep(/Item ID/, @temp);
@sku = grep(/SKU/, @temp);
@upc = grep(/UPC/, @temp);
@weight = grep(/Weight/, @temp);

print @product;
print @id;
print @sku;
print @upc;
print @weight;


However, I don't know how to use LWP. Again, would the code run faster
if I used LWP?
 
U

Uri Guttman

c> I know I shouldn't critize free help, but you seem to have some anger
c> management issues.

nope. i have bad code anger issues. i deal with this in code reviews all
the time. i just don't get how people come up with wacky and slow ways
to do things. i have seen worse code that read in files, parsed them,
wrote them out (untouched) and read them in again.


c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> $temp = `lynx -accept_all_cookies -dump $build`;c> Would using LWP speed up the code? By the way, this code is meant to
c> run on a server with restricted access. Ie, I can't install stuff from
c> cpan on that server.

if you have access to load scripts you can load pure perl modules
too. this is an FAQ.

c> open(OUTFILE, '>out');
c> print OUTFILE $temp;
c> close OUTFILE;c> open(OUT, '<', 'out') || die "cant open: $!";
c> Huh? Are you saying I don't need the 'out' file?

yes. why do you think you need that file? you call backticks and get the
html page in $temp. why do you think you need a file to process that
data? you already have it inside perl.

uri
 
U

Uri Guttman

yes.

c> Maybe something like this?
c> % more parse.pl
c> #!/usr/bin/perl -w

c> my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
c> my @temp;

c> open(IN, '<', 'input') || die "cant open: $!";
c> $read = <IN>;
c> chomp($read);
c> $build = "http://www.doba.com/members/catalog/".$read.".html";
c> @temp = `lynx -accept_all_cookies -dump $build`;

c> @product = grep(/Product ID/, @temp);
c> @id = grep(/Item ID/, @temp);
c> @sku = grep(/SKU/, @temp);
c> @upc = grep(/UPC/, @temp);
c> @weight = grep(/Weight/, @temp);

c> print @product;
c> print @id;
c> print @sku;
c> print @upc;
c> print @weight;


c> However, I don't know how to use LWP. Again, would the code run faster
c> if I used LWP?

better but forking off lynx is still slow. LWP should be much faster. if
you want speed (and with the data size you have, you want it), use LWP.

depending on how fast you need it (cpu usage will spike with the greps
you have) you can also change all that to parse out what you want with
regexes. (again, that assumes a known fixed html page layout which you
seem to have).

uri
 
A

A. Sinan Unur

He seems to constantly come across this way. I really wish he could
see things from other points of view.
...


As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
as a good start in the right direction.

All the OP needs is LWP::Simple and HTML::TableExtract.

In fact, I wrote a whole script that took only 0.8 seconds to download
and parse a single page (of course, with more id's in a file, the only
real limit on the speed is the network latency and transfer speed) but I
have decided not to post it as I do not know what his intentions are.

As for you, pick a posting id and stick with it.

PLONKETY PLONK!

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
C

chadda

He seems to constantly come across this way. I really wish he could see
things from other points of view.
...

As a simple answer, take a look at LWP:UserAgent
(http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm),
as a good start in the right direction.


I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

#!/usr/bin/perl -w

use LWP::UserAgent;
use HTTP::Request;
use HTTP::Cookies;

my ($read, $build, @product, @id, @sku, @upc, @weight);
my @temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;
chomp($read);
$build = 'http://www.doba.com/members/catalog/'.$read.'.html';
#@temp = `lynx -accept_all_cookies -dump $build`;

my $ua = LWP::UserAgent->new;
$ua->agent("OMEGA SPARC DESTROYER/69");

my $request = HTTP::Request->new('GET');
$request->url($build);

my $cookie_jar = HTTP::Cookies->new;
$cookie_jar->add_cookie_header($request);

my $response = $ua->request($request);

my $code = $response->code;
print $code;

@temp = $request->content;

@product = grep(/Product ID/, @temp);
@id = grep(/Item ID/, @temp);
@sku = grep(/SKU/, @temp);
@upc = grep(/UPC/, @temp);
@weight = grep(/Weight/, @temp);

print @product;
print @id;
print @sku;
print @upc;
print @weight;

% ./parse.pl
500%
 
A

A. Sinan Unur

(e-mail address removed) wrote in
....

....

I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

As I mentioned elsewhere, all you need is LWP::Simple.

So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;


my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data::Dumper;
print Dumper \%data;
}
}
}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];
}

__END__

C:\Temp> timethis p list

$VAR1 = {
'Product ID' => '3308191',
'UPC' => '896207999816',
'Item ID' => '3653992',
'SKU' => '8930'
};

TimeThis : Command Line : p list
TimeThis : Start Time : Thu May 15 18:19:28 2008
TimeThis : End Time : Thu May 15 18:19:29 2008
TimeThis : Elapsed Time : 00:00:01.062

Comparing this to the overhead of an empty script:

C:\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

C:\Temp> timethis t

TimeThis : Command Line : t
TimeThis : Start Time : Thu May 15 18:20:38 2008
TimeThis : End Time : Thu May 15 18:20:38 2008
TimeThis : Elapsed Time : 00:00:00.218

It took 0.844 seconds to retrieve and parse the required information. Of
course, the time cost would be better amortized if you ran a lot of
these queries.



--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
C

chadda

(e-mail address removed) wrote in
...

5.812/lib/LWP/UserAgent.pm),

...

I just tried LWP, and now I can't get the code to work for the life of
me. Here is what I attempted

As I mentioned elsewhere, all you need is LWP::Simple.

So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;

my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data::Dumper;
print Dumper \%data;
}
}

}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];

}

__END__

C:\Temp> timethis p list

$VAR1 = {
'Product ID' => '3308191',
'UPC' => '896207999816',
'Item ID' => '3653992',
'SKU' => '8930'
};

TimeThis : Command Line : p list
TimeThis : Start Time : Thu May 15 18:19:28 2008
TimeThis : End Time : Thu May 15 18:19:29 2008
TimeThis : Elapsed Time : 00:00:01.062

Comparing this to the overhead of an empty script:

C:\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

C:\Temp> timethis t

TimeThis : Command Line : t
TimeThis : Start Time : Thu May 15 18:20:38 2008
TimeThis : End Time : Thu May 15 18:20:38 2008
TimeThis : Elapsed Time : 00:00:00.218

It took 0.844 seconds to retrieve and parse the required information. Of
course, the time cost would be better amortized if you ran a lot of
these queries.

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:http://www.rehabitation.com/clpmisc/


When I try to run this code, I keep getting a blank url.
 
A

A. Sinan Unur

(e-mail address removed) wrote in

[ Do not quote in full. Do not quote sigs. ]
So, here is a fish for you:

C:\Temp> cat p.pl
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use LWP::Simple;

my ($input_file) = @ARGV;
die "No input file specified\n" unless defined $input_file;

open my $INPUT, '<', $input_file
or die "Cannot open '$input_file': $!";

ID:
while ( my $id = <$INPUT> ) {
chomp $id;

my $url = make_url( $id );
my $html = get $url;

unless ( defined $html ) {
warn "Error downloading from '$url'\n";
next ID;
}

my $parser = HTML::TokeParser->new( \$html );

TABLE:
while ( my $token = $parser->get_tag('table') ) {
if ( lc $token->[1]{id} eq 'product_details' ) {
my $td = $parser->get_tag('td');
last TABLE unless $td;
my $cell = $parser->get_text('/td');
my %data;
while ( $cell =~ /\s*([^:]+?):\s+(\d+)\s+/g ) {
$data{$1} = $2;
}
use Data::Dumper;
print Dumper \%data;
}
}

}

sub make_url {
return
sprintf q{http://www.doba.com/members/catalog/%s.html}, $_[0];

}

__END__
....

When I try to run this code, I keep getting a blank url.

Well, did you provide it with a file containing the id numbers? How do
you know the URL is blank? Did you modify the code? If you did, why did
you not post the relevant modifications?

I would have normally put the id number in the __DATA__ section, but
since you implied that you already had an input file with id numbers, I
followed your example.

In any case, unless you take active steps to help others help you, this
will be the sum total of the help I will provide you.

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
J

Jürgen Exner

I 'll eventually have the input file filled with 350 million items.
Right now there is only one

$more input
3308191

The following program reads in the number from the file named 'input'
and builds a url form this number. Then it builds a url from this
number. I have lynx then dump the data into a file called 'out' and
then just grep the entire thing for the Product Number, Product ID,
SKU, UPC, and weight.


m-net% more parse.pl
#!/usr/bin/perl -w

my (@shit, $read, $build, @product, @id, @sku, @upc, @weight);
my $temp;

open(IN, '<', 'input') || die "cant open: $!";
$read = <IN>;

I suppose you want to turn that line into a while loop once you got more
than one single item to process.
However, considering network latency and response times it may very well
be worthwhile to trigger multiple HTTP requests in parallel, such that
your processing code will never have to wait for network responses.

Other issues like shelling out an expensive external process, that
expensive but useless temporary file, or trying to parse HTML code using
REs others already mentioned.

jue
 
J

Jürgen Exner

I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Uri Guttman
better but forking off lynx is still slow. LWP should be much faster. if
you want speed (and with the data size you have, you want it), use LWP.

This may depend on many parameters, but the overhead of system()ing
may be quite low. The overhead of opening a new HTTP connection for
each line may be larger. LWP will have a chance to use persistent
connections...

Yours,
Ilya
 
G

Gordon Etly

Uri said:
Gordon Etly <[email protected]> writes:

[please don't left pad quoted text with spaces]
as usual, no help from you.

I'm just pointing out what is. It's you who keep bringing this upon
yourself. You are constantly rude and arrogant to people, then you
wonder why people sometimes post back, like the OP did. If you can't
handle receiving comments about what you post, then don't post. If you
can't take it, don't dish it out.
which i already told him and we have already improved his code a good
deal. try to keep up.

I would think someone who has been on UseNet as logn as you would know
that posts don't always come down at the same time (or order) from every
server. Case in point, I had not seen such a post mentioning it until
later on.
 
J

Jürgen Exner

Gordon Etly said:
I'm just pointing out what is. It's you who keep bringing this upon
yourself. You are constantly rude and arrogant to people, then you

Changing your identity again because everyone filtered you?

jue
 
G

Gordon Etly

I guess everyone had filtered you so you had to create a new identity

I have not changed my identity. My name is Gordon Etly. I have not
changed that part, nor made any attempt to hide it, so your statement is
false.

I happen to be a sys op for the company I work for, including our mail
server, so I am able to add entries to /etc/aliases (which I commonly
use to public variants of my main email address that any unwanted
mailings can be easily stopped.) I've never seen any rule saying "never
change your email field", as that is anyone's right.
Yeah, it's easy enough to copy what other people had mentioned
already.

I had not seen that mentioned at all before I posted. Funny, I see you
and your fellows do exactly this all the time (posting essentially the
same answer that was already given by someone else), but now it's
suddenly a bad thing. Please make up your minds.

In this case, there were no replies mentioning LWP::UserAgent. Uri did
mention LWP very briefly, but LWP has several modules. I was more
specific.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top