Parallel LWP callback doesn't terminate.

Peter Hill · Mar 23, 2006

Hi,
I'm trying to get web documents returned for analysis using the RobotUA part
of LWP:

arallel, but for some reason the callback function never completes;
specifically in the sample code below (output at end) the line
print "We never get here.\n";
is never executed, which is where I would expect to call my analysis code.
What dumb error am I committing?
TIA
Peter Hill

#! /usr/bin/perl -w
use strict;
use LWP:

arallel::RobotUA qw

CALLBACK);
my $MAX_SIZE = 100000; #bytes

my $ua = LWP:

arallel::RobotUA->new('foobar/1.0','(e-mail address removed)');
$ua -> delay(0.5);
$ua -> in_order (1); # handle requests in order of registration
$ua -> duplicates(0); # ignore duplicates
$ua -> timeout (2); # in seconds
$ua -> redirect (1); # follow redirects
$ua -> max_hosts(5);
$ua -> max_req(5);

# register initial request
addURL('http://www.cpan.org/');
# this is the main (implicit) loop
my $something = $ua -> wait(15);

sub callback_for_parse {
my ($content, $response, $protocol, $entry) = @_;
print "handling answer from ",$response->request->url,": ",
length($content), " bytes, Code ", $response->code, ", ",
$response->message,"\n";
if (length $content) {
print "... received chunk ",length($content)," bytes, type
".$response->content_type."\n";
$response->add_content($content);
if (length($response->content) < $MAX_SIZE and $response->content_type
=~ /text\/html/i) {
print "... returning ",length($content)."\n";
# print "content is :".$content."\n";
print "response is :".$response."\n";
print "protocol is :".$protocol."\n";
print "entry is :".$entry."\n";
return length $content;
}
else{
print "oversize or not text/html: content-type is ".$response ->
content_type."\n";
}
}
print "We never get here.\n";
return C_ENDCON;
}

sub addURL {
my $url = shift;
my $request = new HTTP::Request('GET', $url);
$ua -> register($request,\&callback_for_parse);
print "... registered request for $url\n";
}

# output
.... registered request for http://www.cpan.org/
handling answer from http://www.cpan.org/: 4138 bytes, Code 200, OK
.... received chunk 4138 bytes, type text/html
.... returning 4138
response is :HTTP::Response=HASH(0x155b87c)
protocol is :LWP:

arallel:

rotocol::http=HASH(0x2951c18)
entry is :LWP:

arallel::UserAgent::Entry=HASH(0x28e8da8)
handling answer from http://www.cpan.org/: 1665 bytes, Code 200, OK
.... received chunk 1665 bytes, type text/html
.... returning 1665
response is :HTTP::Response=HASH(0x155b87c)
protocol is :LWP:

arallel:

rotocol::http=HASH(0x2951c18)
entry is :LWP:

arallel::UserAgent::Entry=HASH(0x28e8da8)

xhoster · Mar 23, 2006

Peter Hill said:
Hi,
I'm trying to get web documents returned for analysis using the RobotUA
part of LWP:arallel, but for some reason the callback function never
completes; specifically in the sample code below (output at end) the
line print "We never get here.\n";
is never executed, which is where I would expect to call my analysis
code. What dumb error am I committing?
....
sub callback_for_parse {
my ($content, $response, $protocol, $entry) = @_;
if (length $content) {
if (length($response->content) < $MAX_SIZE and
$response->content_type =~ /text\/html/i) { ....
return length $content;
}
else{
print "oversize or not text/html: content-type is ".
$response -> content_type."\n";
}
}
print "We never get here.\n";
return C_ENDCON;
}

As far as I can tell, the only error you are committing is in your
expectations, not in your code. The only way "We never get here"
should be printed is if you either get called with empty content (and why
would that happen? If there is nothing to send to the callback, why
call it?), or with an over-sized chunk. Otherwise, the
"return length $content;" will be activated, by-passing the print.

Xho

Peter Hill · Mar 24, 2006

As far as I can tell, the only error you are committing is in your
expectations, not in your code. The only way "We never get here"
should be printed is if you either get called with empty content (and why
would that happen? If there is nothing to send to the callback, why
call it?), or with an over-sized chunk. Otherwise, the
"return length $content;" will be activated, by-passing the print.

Xho

Yes, thank you, that makes perfect sense. I was basing the callback function
on an article be Randal Shwartz ("Parallel Bad Links") but I can now see
that that doesn't work either; something must have changed since the article
was written. It appears that I need to do my analysis on each chunk as it is
returned rather than expecting to deal with a complete document.

Thanks,
Peter Hill.

xhoster · Mar 24, 2006

Yes, thank you, that makes perfect sense. I was basing the callback
function on an article be Randal Shwartz ("Parallel Bad Links") but I can
now see that that doesn't work either; something must have changed since
the article was written.

Yep, I see that it did used to call the callback one final time with
zero content length, but it no longer does. The changes seems to have
happened in 2.54_19, in LWP/Parallel/Protocol/http.pm, with this line:

if ( $response && &headers($response) && length($buf)) {

The "&& length($buf)" didn't used to be there. It says is a bug fix, but
now I'm starting to think it was more of a bug-introduction

It appears that I need to do my analysis on each
chunk as it is returned rather than expecting to deal with a complete
document.

I believe you would now override the on_return in order to process the
"complete" document, but you could still use the callback to mark the
document as "complete" after maxsize is reached, even if it isn't truly
complete. (There is supposed to be max_size attribute which will
automatically cap the size without needing to use callbacks at all, but
there doesn't seem to be any supported way to set this attribute!)

Xho

Peter Hill · Mar 25, 2006

Yep, I see that it did used to call the callback one final time with
zero content length, but it no longer does. The changes seems to have
happened in 2.54_19, in LWP/Parallel/Protocol/http.pm, with this line:

if ( $response && &headers($response) && length($buf)) {

The "&& length($buf)" didn't used to be there. It says is a bug fix, but
now I'm starting to think it was more of a bug-introduction

I believe you would now override the on_return in order to process the
"complete" document, but you could still use the callback to mark the
document as "complete" after maxsize is reached, even if it isn't truly
complete. (There is supposed to be max_size attribute which will
automatically cap the size without needing to use callbacks at all, but
there doesn't seem to be any supported way to set this attribute!)

Xho

Thanks again for the detective work; I can see the way forward now, with an
overridden on_return.
Regards,
Peter Hill

https request failing	2	Sep 18, 2012
LWP::Parallel::UserAgent does not follow redirects	0	Dec 2, 2003
reading LWP in chunks	6	Oct 18, 2010
NTLM and LWP::UserAgent	4	Sep 12, 2006
LWP::UserAgent infinite hang	1	Mar 5, 2007
can LWP handle this?	7	Oct 21, 2008
LWP gives 302 Found after update?	5	Aug 16, 2005
LWP Doesn't Seem To Save Cookies:	7	Mar 23, 2005

Parallel LWP callback doesn't terminate.

Peter Hill

xhoster

Peter Hill

xhoster

Peter Hill

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads