Invisible cache for LWP / Mechanize?

J

jakob.fix

Hi, I am using LWP to mirror the McAfee CommonUpdater ftp site for my
department. There's a cron job that runs twice a day to see if there
are any changes. If so, it mirrors the site.

My problem is that LWP seems to maintain a cache I don't know about.
When requesting a URL and do the Mechanize ->find_all_links() method,
it will not always return the correct, current list of links. (code
snippet further down)

I guess, it's something basic, but the list archive didn't throw up
anything useful.

Thanks for your help,
Jakob.

PS: If there's a more appropriate group/forum/list, thanks for pointing
it out to me.

sub updateLocalRepository {
# remove previously downloaded files
chdir($mcafee_dir);
remove qw( * );

$ua->agent("McAfee AutoUpdate");
$ua->get( "$mcafee_url1" );
$ua->reload(); # an attempt to empty the cache
my @links = $ua->find_all_links(); # doesn't return current links

for my $link ( @links ) {
my $url = $link->url_abs;
my $filename = $url;
if ($filename eq $mcafee_url2 ) {
&fetchRecursively( $link );
}
unless ($filename eq $mcafee_host or $filename =~ "Current") {
&fetchFile( $filename, $url );
}
}
return 0;
}
 
B

brian d foy

My problem is that LWP seems to maintain a cache I don't know about.
When requesting a URL and do the Mechanize ->find_all_links() method,
it will not always return the correct, current list of links. (code
snippet further down)

I tend to think WWW::Mechanize is the wrongtool for this. You can
do things just as easily with something like HTML::SimpleLinkExtor,
and I know that doesn't use a hidden cache :)
 
J

jfix

Thanks for your response, brian.

Problem is I am not sure HTML::SimpleLinkExtor works for ftp sites. I
am using LWP because it's the only module I found to be able to access
the ftp site from behind our authenticated HTTP firewall.

In the beginning I tried Net::FTP::Recursive, together with Net::Config
for the firewall configuration, but this didn't work either.

Well, I am not exactly stuck as it works from time to time, but it
would be nice to actually understand why LWP doesn't return the actual
state. May have to plunge in the module code to understand ...
Again, thanks for your reply,
 
B

brian d foy

jfix said:
Thanks for your response, brian.

Problem is I am not sure HTML::SimpleLinkExtor works for ftp sites. I
am using LWP because it's the only module I found to be able to access
the ftp site from behind our authenticated HTTP firewall.

SimpleLinkExtor doesn't do anything with the network. It just
extracts links from the HTML you give it. You can use LWP like
you are, but dump WWW::Mechanize if you only need it to extract links.
You won't have to worry about all of its baggage.
 
J

jfix

It only seems that LWP is the problem's source. So if LWP gives me
outdated data, then SimpleLinkExtor won't be able to fix that.
Thanks anyway,
Jakob.
 
J

jfix

Just thinking: I am behind a proxy/firewall. It must be this proxy
that caches previous results. Will have to look int ways to force the
proxy to not return cached responses, but fresh data.
Any ideas?
 
J

jfix

Hi Steven,
First see

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9

Then use HTTP::Headers to construct the appropriate headers;
use HTTP::Request to construct your request.

Use LWP::UserAgent or "Mech" to dispatch your request.

thanks for your response. I was aware of the different no-cache
mechanisms, but I didn't know that they can be put in the HTTP Request
as well, I thought only the server can specify such a header.

Admittedly, not very original, but this is what I finally ended up
doing:

### create "no-cache" header
my $http_headers = HTTP::Headers->new;
$http_headers->header("Cache-Control" => "no-cache");

### let's pretend to be the real thing
$ua->agent("McAfee AutoUpdate");
$ua->default_headers( $http_headers );
Thanks again for your help.
 
G

Gisle Aas

jfix said:
Admittedly, not very original, but this is what I finally ended up
doing:

### create "no-cache" header
my $http_headers = HTTP::Headers->new;
$http_headers->header("Cache-Control" => "no-cache");

### let's pretend to be the real thing
$ua->agent("McAfee AutoUpdate");
$ua->default_headers( $http_headers );

Hint: you can shorten this as:

$ua->agent("McAfee AutoUpdate"); # pretend to be the real thing
$ua->default_header( Cache_Control => "no-cache" );
 
J

jfix

Gisle,
Hint: you can shorten this as:

$ua->agent("McAfee AutoUpdate"); # pretend to be the real thing
$ua->default_header( Cache_Control => "no-cache" );

you're right, of course, as regards the shortcutting. however, using
strict I cannot use the bareword Cache_Control.

thanks for your reply.
 
A

Anno Siegel

Please give an attribution.
Gisle,

you're right, of course, as regards the shortcutting. however, using
strict I cannot use the bareword Cache_Control.

Yes, you can. See "perldoc perldata, look for "=>".

Anno
 
A

Anno Siegel

Please give an attribution.
Gisle,

you're right, of course, as regards the shortcutting. however, using
strict I cannot use the bareword Cache_Control.

Yes, you can. See "perldoc perldata", look for "=>".

Anno
 
A

Anno Siegel

jfix said:
Anno,



excuse me, but I don't understand what you mean with this.

An attribution is a line that identifies the poster of the text you
are replying to. You did give one this time: "Anno Siegel wrote:".

Please see the posting guidelines to this group. They are posted
frequently.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,797
Messages
2,569,647
Members
45,377
Latest member
Zebacus

Latest Threads

Top