Parallel LWP callback doesn't terminate.

P

Peter Hill

Hi,
I'm trying to get web documents returned for analysis using the RobotUA part
of LWP::parallel, but for some reason the callback function never completes;
specifically in the sample code below (output at end) the line
print "We never get here.\n";
is never executed, which is where I would expect to call my analysis code.
What dumb error am I committing?
TIA
Peter Hill

#! /usr/bin/perl -w
use strict;
use LWP::parallel::RobotUA qw:)CALLBACK);
my $MAX_SIZE = 100000; #bytes

my $ua = LWP::parallel::RobotUA->new('foobar/1.0','(e-mail address removed)');
$ua -> delay(0.5);
$ua -> in_order (1); # handle requests in order of registration
$ua -> duplicates(0); # ignore duplicates
$ua -> timeout (2); # in seconds
$ua -> redirect (1); # follow redirects
$ua -> max_hosts(5);
$ua -> max_req(5);

# register initial request
addURL('http://www.cpan.org/');
# this is the main (implicit) loop
my $something = $ua -> wait(15);

sub callback_for_parse {
my ($content, $response, $protocol, $entry) = @_;
print "handling answer from ",$response->request->url,": ",
length($content), " bytes, Code ", $response->code, ", ",
$response->message,"\n";
if (length $content) {
print "... received chunk ",length($content)," bytes, type
".$response->content_type."\n";
$response->add_content($content);
if (length($response->content) < $MAX_SIZE and $response->content_type
=~ /text\/html/i) {
print "... returning ",length($content)."\n";
# print "content is :".$content."\n";
print "response is :".$response."\n";
print "protocol is :".$protocol."\n";
print "entry is :".$entry."\n";
return length $content;
}
else{
print "oversize or not text/html: content-type is ".$response ->
content_type."\n";
}
}
print "We never get here.\n";
return C_ENDCON;
}

sub addURL {
my $url = shift;
my $request = new HTTP::Request('GET', $url);
$ua -> register($request,\&callback_for_parse);
print "... registered request for $url\n";
}

# output
.... registered request for http://www.cpan.org/
handling answer from http://www.cpan.org/: 4138 bytes, Code 200, OK
.... received chunk 4138 bytes, type text/html
.... returning 4138
response is :HTTP::Response=HASH(0x155b87c)
protocol is :LWP::parallel::protocol::http=HASH(0x2951c18)
entry is :LWP::parallel::UserAgent::Entry=HASH(0x28e8da8)
handling answer from http://www.cpan.org/: 1665 bytes, Code 200, OK
.... received chunk 1665 bytes, type text/html
.... returning 1665
response is :HTTP::Response=HASH(0x155b87c)
protocol is :LWP::parallel::protocol::http=HASH(0x2951c18)
entry is :LWP::parallel::UserAgent::Entry=HASH(0x28e8da8)
 
X

xhoster

Peter Hill said:
Hi,
I'm trying to get web documents returned for analysis using the RobotUA
part of LWP::parallel, but for some reason the callback function never
completes; specifically in the sample code below (output at end) the
line print "We never get here.\n";
is never executed, which is where I would expect to call my analysis
code. What dumb error am I committing?
....
sub callback_for_parse {
my ($content, $response, $protocol, $entry) = @_;
if (length $content) {
if (length($response->content) < $MAX_SIZE and
$response->content_type =~ /text\/html/i) { ....
return length $content;
}
else{
print "oversize or not text/html: content-type is ".
$response -> content_type."\n";
}
}
print "We never get here.\n";
return C_ENDCON;
}

As far as I can tell, the only error you are committing is in your
expectations, not in your code. The only way "We never get here"
should be printed is if you either get called with empty content (and why
would that happen? If there is nothing to send to the callback, why
call it?), or with an over-sized chunk. Otherwise, the
"return length $content;" will be activated, by-passing the print.

Xho
 
P

Peter Hill

As far as I can tell, the only error you are committing is in your
expectations, not in your code. The only way "We never get here"
should be printed is if you either get called with empty content (and why
would that happen? If there is nothing to send to the callback, why
call it?), or with an over-sized chunk. Otherwise, the
"return length $content;" will be activated, by-passing the print.

Xho

Yes, thank you, that makes perfect sense. I was basing the callback function
on an article be Randal Shwartz ("Parallel Bad Links") but I can now see
that that doesn't work either; something must have changed since the article
was written. It appears that I need to do my analysis on each chunk as it is
returned rather than expecting to deal with a complete document.

Thanks,
Peter Hill.
 
X

xhoster

Yes, thank you, that makes perfect sense. I was basing the callback
function on an article be Randal Shwartz ("Parallel Bad Links") but I can
now see that that doesn't work either; something must have changed since
the article was written.

Yep, I see that it did used to call the callback one final time with
zero content length, but it no longer does. The changes seems to have
happened in 2.54_19, in LWP/Parallel/Protocol/http.pm, with this line:

if ( $response && &headers($response) && length($buf)) {

The "&& length($buf)" didn't used to be there. It says is a bug fix, but
now I'm starting to think it was more of a bug-introduction :)
It appears that I need to do my analysis on each
chunk as it is returned rather than expecting to deal with a complete
document.

I believe you would now override the on_return in order to process the
"complete" document, but you could still use the callback to mark the
document as "complete" after maxsize is reached, even if it isn't truly
complete. (There is supposed to be max_size attribute which will
automatically cap the size without needing to use callbacks at all, but
there doesn't seem to be any supported way to set this attribute!)

Xho
 
P

Peter Hill

Yep, I see that it did used to call the callback one final time with
zero content length, but it no longer does. The changes seems to have
happened in 2.54_19, in LWP/Parallel/Protocol/http.pm, with this line:

if ( $response && &headers($response) && length($buf)) {

The "&& length($buf)" didn't used to be there. It says is a bug fix, but
now I'm starting to think it was more of a bug-introduction :)


I believe you would now override the on_return in order to process the
"complete" document, but you could still use the callback to mark the
document as "complete" after maxsize is reached, even if it isn't truly
complete. (There is supposed to be max_size attribute which will
automatically cap the size without needing to use callbacks at all, but
there doesn't seem to be any supported way to set this attribute!)

Xho

Thanks again for the detective work; I can see the way forward now, with an
overridden on_return.
Regards,
Peter Hill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top