Newbie LWP question - simulate browser?

philthym · May 10, 2004

Hi

As the title suggests, I am a Perl newbie. I am trying to monitor a
remote site and would like to time it in returning all objects on the
page, ie the HTML and all the associated GIFs, bits of JavaScript,
Java applets and so on.

Here is the code as it stands today:

#!/usr/bin/perl
use CGI;
use LWP::Simple;
use Time::HiRes qw(gettimeofday);

$URL="http://www.xyz.com/index.html";

$usec1 = gettimeofday;
$timenow = localtime();

$HomePage=get($URL);

if
($HomePage =~ /String/)
{ $usec2 = gettimeofday;
$elapsed = $usec2-$usec1;
print "$timenow Page retrieved in $elapsed seconds\n" }
else
{ print "$timenow Page not retrieved\n"; }

I'm not sure I understand the whole lwp/get thing! What I'm wondering
is does this request effectively initiate the web server to return all
objects or just the HTML itself? If it's returning everything, then
does the second timer occur after all objects have been returned? In
other words, does this code do what I want it to? If not, any ideas
how I would achieve my aim, please?

Any help would be gratefully appreciated.

Thanks

Phil

Sherm Pendley · May 10, 2004

philthym said:
I'm not sure I understand the whole lwp/get thing! What I'm wondering
is does this request effectively initiate the web server to return all
objects or just the HTML itself?

It does *exactly* what you ask it to, no more - it fetches index.html.
Parsing the HTML, extracting the <img ...> elements from it, and making
additional requests to the server to fetch the images they point to, will
require additional code.

Have a look at HTML:

arser - it's a good place to start.

sherm--

Joe Smith · May 10, 2004

philthym said:
As the title suggests, I am a Perl newbie. I am trying to monitor a
remote site and would like to time it in returning all objects on the
page, ie the HTML and all the associated GIFs, bits of JavaScript,
Java applets and so on.

It's one thing to fetch a Javascript. It is quite another to fetch
the things that would be requested had the Javascript been executed.
For that, you need a proxy that logs the requests from a real browser.

There are several, including the "Web Scrapting Proxy"
http://www.research.att.com/~hpk/wsp/

-Joe

philthym · May 11, 2004

Sherm Pendley said:
It does *exactly* what you ask it to, no more - it fetches index.html.
Parsing the HTML, extracting the <img ...> elements from it, and making
additional requests to the server to fetch the images they point to, will
require additional code.

Have a look at HTML:arser - it's a good place to start.

sherm--

Thanks Sherm, I thought it would be something like that. I'll check out HTML

arser.

Regards

Phil

Script using LWP::UserAgent is sometimes failing with 500 error,although server reports 200	9	Aug 23, 2011
lwp POST and proxies	4	Dec 12, 2007
LWP and Unicode	17	Oct 2, 2006
runaway memory leak with LWP and Fork()ing on Windows	4	Nov 2, 2007
Newbie LWP Question	7	Jan 7, 2004
LWP::Simple get() refined problem	6	Sep 29, 2003
Parallel LWP callback doesn't terminate.	4	Mar 23, 2006
LWP::Simple get() problem	1	Sep 26, 2003

Newbie LWP question - simulate browser?

philthym

Sherm Pendley

Joe Smith

philthym

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads