Parallel LWP callback doesn't terminate.

Discussion in 'Perl Misc' started by Peter Hill, Mar 23, 2006.

  1. Peter Hill

    Peter Hill Guest

    Hi,
    I'm trying to get web documents returned for analysis using the RobotUA part
    of LWP::parallel, but for some reason the callback function never completes;
    specifically in the sample code below (output at end) the line
    print "We never get here.\n";
    is never executed, which is where I would expect to call my analysis code.
    What dumb error am I committing?
    TIA
    Peter Hill

    #! /usr/bin/perl -w
    use strict;
    use LWP::parallel::RobotUA qw:)CALLBACK);
    my $MAX_SIZE = 100000; #bytes

    my $ua = LWP::parallel::RobotUA->new('foobar/1.0','');
    $ua -> delay(0.5);
    $ua -> in_order (1); # handle requests in order of registration
    $ua -> duplicates(0); # ignore duplicates
    $ua -> timeout (2); # in seconds
    $ua -> redirect (1); # follow redirects
    $ua -> max_hosts(5);
    $ua -> max_req(5);

    # register initial request
    addURL('http://www.cpan.org/');
    # this is the main (implicit) loop
    my $something = $ua -> wait(15);

    sub callback_for_parse {
    my ($content, $response, $protocol, $entry) = @_;
    print "handling answer from ",$response->request->url,": ",
    length($content), " bytes, Code ", $response->code, ", ",
    $response->message,"\n";
    if (length $content) {
    print "... received chunk ",length($content)," bytes, type
    ".$response->content_type."\n";
    $response->add_content($content);
    if (length($response->content) < $MAX_SIZE and $response->content_type
    =~ /text\/html/i) {
    print "... returning ",length($content)."\n";
    # print "content is :".$content."\n";
    print "response is :".$response."\n";
    print "protocol is :".$protocol."\n";
    print "entry is :".$entry."\n";
    return length $content;
    }
    else{
    print "oversize or not text/html: content-type is ".$response ->
    content_type."\n";
    }
    }
    print "We never get here.\n";
    return C_ENDCON;
    }

    sub addURL {
    my $url = shift;
    my $request = new HTTP::Request('GET', $url);
    $ua -> register($request,\&callback_for_parse);
    print "... registered request for $url\n";
    }

    # output
    .... registered request for http://www.cpan.org/
    handling answer from http://www.cpan.org/: 4138 bytes, Code 200, OK
    .... received chunk 4138 bytes, type text/html
    .... returning 4138
    response is :HTTP::Response=HASH(0x155b87c)
    protocol is :LWP::parallel::protocol::http=HASH(0x2951c18)
    entry is :LWP::parallel::UserAgent::Entry=HASH(0x28e8da8)
    handling answer from http://www.cpan.org/: 1665 bytes, Code 200, OK
    .... received chunk 1665 bytes, type text/html
    .... returning 1665
    response is :HTTP::Response=HASH(0x155b87c)
    protocol is :LWP::parallel::protocol::http=HASH(0x2951c18)
    entry is :LWP::parallel::UserAgent::Entry=HASH(0x28e8da8)
    Peter Hill, Mar 23, 2006
    #1
    1. Advertising

  2. Peter Hill

    Guest

    "Peter Hill" <> wrote:
    > Hi,
    > I'm trying to get web documents returned for analysis using the RobotUA
    > part of LWP::parallel, but for some reason the callback function never
    > completes; specifically in the sample code below (output at end) the
    > line print "We never get here.\n";
    > is never executed, which is where I would expect to call my analysis
    > code. What dumb error am I committing?


    ....
    > sub callback_for_parse {
    > my ($content, $response, $protocol, $entry) = @_;
    > if (length $content) {
    > if (length($response->content) < $MAX_SIZE and
    > $response->content_type =~ /text\/html/i) {

    ....
    > return length $content;
    > }
    > else{
    > print "oversize or not text/html: content-type is ".
    > $response -> content_type."\n";
    > }
    > }
    > print "We never get here.\n";
    > return C_ENDCON;
    > }
    >


    As far as I can tell, the only error you are committing is in your
    expectations, not in your code. The only way "We never get here"
    should be printed is if you either get called with empty content (and why
    would that happen? If there is nothing to send to the callback, why
    call it?), or with an over-sized chunk. Otherwise, the
    "return length $content;" will be activated, by-passing the print.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Mar 23, 2006
    #2
    1. Advertising

  3. Peter Hill

    Peter Hill Guest

    <> wrote in message
    news:<20060323110859.954$>...
    > "Peter Hill" <> wrote:

    [snip]
    > > print "We never get here.\n";
    > > return C_ENDCON;
    > > }
    > >

    >
    > As far as I can tell, the only error you are committing is in your
    > expectations, not in your code. The only way "We never get here"
    > should be printed is if you either get called with empty content (and why
    > would that happen? If there is nothing to send to the callback, why
    > call it?), or with an over-sized chunk. Otherwise, the
    > "return length $content;" will be activated, by-passing the print.
    >
    > Xho
    >
    > --
    > -------------------- http://NewsReader.Com/ --------------------
    > Usenet Newsgroup Service $9.95/Month 30GB


    Yes, thank you, that makes perfect sense. I was basing the callback function
    on an article be Randal Shwartz ("Parallel Bad Links") but I can now see
    that that doesn't work either; something must have changed since the article
    was written. It appears that I need to do my analysis on each chunk as it is
    returned rather than expecting to deal with a complete document.

    Thanks,
    Peter Hill.
    Peter Hill, Mar 24, 2006
    #3
  4. Peter Hill

    Guest

    "Peter Hill" <> wrote:
    > <> wrote in message
    > news:<20060323110859.954$>...
    > > "Peter Hill" <> wrote:

    > [snip]
    > > > print "We never get here.\n";
    > > > return C_ENDCON;
    > > > }
    > > >

    > >
    > > As far as I can tell, the only error you are committing is in your
    > > expectations, not in your code. The only way "We never get here"
    > > should be printed is if you either get called with empty content (and
    > > why would that happen? If there is nothing to send to the callback,
    > > why call it?), or with an over-sized chunk. Otherwise, the
    > > "return length $content;" will be activated, by-passing the print.
    > >


    > Yes, thank you, that makes perfect sense. I was basing the callback
    > function on an article be Randal Shwartz ("Parallel Bad Links") but I can
    > now see that that doesn't work either; something must have changed since
    > the article was written.


    Yep, I see that it did used to call the callback one final time with
    zero content length, but it no longer does. The changes seems to have
    happened in 2.54_19, in LWP/Parallel/Protocol/http.pm, with this line:

    if ( $response && &headers($response) && length($buf)) {

    The "&& length($buf)" didn't used to be there. It says is a bug fix, but
    now I'm starting to think it was more of a bug-introduction :)

    > It appears that I need to do my analysis on each
    > chunk as it is returned rather than expecting to deal with a complete
    > document.


    I believe you would now override the on_return in order to process the
    "complete" document, but you could still use the callback to mark the
    document as "complete" after maxsize is reached, even if it isn't truly
    complete. (There is supposed to be max_size attribute which will
    automatically cap the size without needing to use callbacks at all, but
    there doesn't seem to be any supported way to set this attribute!)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Mar 24, 2006
    #4
  5. Peter Hill

    Peter Hill Guest

    <> wrote in message
    news:20060324121011.116$...
    > "Peter Hill" <> wrote:
    > > <> wrote in message
    > > news:<20060323110859.954$>...
    > > > "Peter Hill" <> wrote:

    > > [snip]
    > > > > print "We never get here.\n";
    > > > > return C_ENDCON;
    > > > > }
    > > > >
    > > >
    > > > As far as I can tell, the only error you are committing is in your
    > > > expectations, not in your code. The only way "We never get here"
    > > > should be printed is if you either get called with empty content (and
    > > > why would that happen? If there is nothing to send to the callback,
    > > > why call it?), or with an over-sized chunk. Otherwise, the
    > > > "return length $content;" will be activated, by-passing the print.
    > > >

    >
    > > Yes, thank you, that makes perfect sense. I was basing the callback
    > > function on an article be Randal Shwartz ("Parallel Bad Links") but I

    can
    > > now see that that doesn't work either; something must have changed since
    > > the article was written.

    >
    > Yep, I see that it did used to call the callback one final time with
    > zero content length, but it no longer does. The changes seems to have
    > happened in 2.54_19, in LWP/Parallel/Protocol/http.pm, with this line:
    >
    > if ( $response && &headers($response) && length($buf)) {
    >
    > The "&& length($buf)" didn't used to be there. It says is a bug fix, but
    > now I'm starting to think it was more of a bug-introduction :)
    >
    > > It appears that I need to do my analysis on each
    > > chunk as it is returned rather than expecting to deal with a complete
    > > document.

    >
    > I believe you would now override the on_return in order to process the
    > "complete" document, but you could still use the callback to mark the
    > document as "complete" after maxsize is reached, even if it isn't truly
    > complete. (There is supposed to be max_size attribute which will
    > automatically cap the size without needing to use callbacks at all, but
    > there doesn't seem to be any supported way to set this attribute!)
    >
    > Xho
    >
    > --
    > -------------------- http://NewsReader.Com/ --------------------
    > Usenet Newsgroup Service $9.95/Month 30GB


    Thanks again for the detective work; I can see the way forward now, with an
    overridden on_return.
    Regards,
    Peter Hill
    Peter Hill, Mar 25, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    520
  2. Zeke Koos
    Replies:
    0
    Views:
    185
    Zeke Koos
    Sep 15, 2003
  3. Zeke Koos
    Replies:
    0
    Views:
    82
    Zeke Koos
    Sep 16, 2003
  4. Alex
    Replies:
    0
    Views:
    200
  5. Leif Wessman
    Replies:
    0
    Views:
    138
    Leif Wessman
    Sep 9, 2004
Loading...

Share This Page