reading LWP in chunks


K

Klaus

Hi Perl programmers,

I am trying to write a Module (its name will be LWP::Chunk) to
read arbitrarily big http-files sequentially in small chunks.

Let me give an example:

With the existing module LWP::UserAgent, you can say:

use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get('http://search.cpan.org/');

With my new module LWP::Chunk (this module still needs to be written),
you would be able to say:

use LWP::Chunk;
my $ck = LWP::Chunk->new('http://search.cpan.org/', {csize => 1024,
timeout => 10});
my $container = '';
while ($ck->read_chunk) {
$container .= $ck->buffer;
# do whatever you want to do here,
# you are even allowed to go last;
}
if ($ck->there_was_an_error) {
die "There has been an error (code=".$ck->errcode.")";
}
# here we have the data in $container

The problem is that I am stuck with writing method $ck->read_chunk. (I
want to read the next chunk of 1024 bytes).

I had a look at LWP::Simple and at LWP::UserAgent, but I could not
find any code that allows to read the next 1024 bytes from 'http://
search.cpan.org/'. (I don't want to read the whole data in one go, I
rather want to read it in smaller chunks)

Can anybody point me to the LWP-internals (maybe LWP::UserAgent,
HTTP::Request, HTTP::Response, etc... ???) which reads a chunk of
data ?

Thanks in advance.
 
Ad

Advertisements

K

Klaus

This already exists :). See the ->add_handler method of LWP::UserAgent
and the :content_cb and :read_size_hint parameters to ->get.

Yes, I can see the add_handler method in LWP::UserAgent (and I can
also see the response_data => sub {...} section which, I think is most
interesting for my purposes):

There ** must ** be a loop somewhere deep inside LWP::UserAgent->get()
that says "...while (read_chunk:)read_size)) { &response_data-
(...); } ..."

At this point, I want to go ** deep ** into the guts of LWP::UserAgent-
get ( -- that could be inside LWP::UserAgent, inside HTTP::Request,
inside HTTP::Response, etc... -- ) to find that loop, rip out that
line that says "read_chunk()" and stick it into my new module
"LWP::Chunk") -- of course, I make sure to read the license document
before ripping out anything.

I have dived into LWP::UserAgent, HTTP::Request, HTTP::Response, but I
can't find that elusive "...while (read_chunk:)read_size))
{ &response_data->(...); } ..."

Can anybody cure my blindness ?
 
K

Klaus

Why? What is wrong with using the existing API?

Nothing, it's just yet another TIMTOWTDI for reading LWP. I personally
prefer reading in chunks using my own while-construct, while others
might prefer a simple call to LWP::UserAgent->get(...) using
callbacks.

I simply want to provide an LWP-module for those who prefer writing
their own while-construct, but to be honest, apart from myself, I
don't know how many there are amongst the perl user community who
prefer writing their own while-construct.
That loop is in LWP::protocol::collect, which is called from
LWP::protocol::http::request (which passes a callback to do the actual
reading).

Thank you very much for this nugget of information. This is most
useful for my future module LWP::Chunk.
 
J

Jim Gibson

Klaus said:
Nothing, it's just yet another TIMTOWTDI for reading LWP. I personally
prefer reading in chunks using my own while-construct, while others
might prefer a simple call to LWP::UserAgent->get(...) using
callbacks.

I believe Ben is suggesting that you implement your LWP::Chunk module
and its while-construct by using the existing add_handler method of
LWP::UserAgent, rather than extracting the code from there and putting
it into your own module. This is known as adding a "layer", and is
commonly done to make using some complicated interface easier to use
for some commonly-used purpose.

The advantage of using an existing module is that you take advantage of
work already done. You also get to use any improvements or bug fixes of
the module you are using.

The disadvantage is that you are then dependent upon that module. If
the API ever changes, you will need to change your module. If support
for the module is ever dropped, you may need to rewrite your own
module. However, since LWP::UserAgent is a widely-used, mature module,
neither of these circumstances is likely.
I simply want to provide an LWP-module for those who prefer writing
their own while-construct, but to be honest, apart from myself, I
don't know how many there are amongst the perl user community who
prefer writing their own while-construct.

Nobody is suggesting that you do not write a module, just that you use
existing code as is, rather than extracting it and copying it.
 
K

Klaus

Yes, precisely. It would also be rather difficult to extract that bit of
functionality without losing a lot of the flexibility of LWP.

Yes, I can see now that ripping out that bit of functionality and
sticking it into my new module is very difficult, there is a lot of
flexibility in the LWP modules that would need to be re-/reverse-
engineered in my new module LWP::Chunk. I don't think I am able to
provide that flexibility in LWP::Chunk.

I don't think that dependence as such is a problem. I would love to be
dependent on a module LWP::UserAgent, but unfortunately I can't (-->
see my problem below)
Well, I kinda was suggesting Klaus didn't write a module :). I'm not
sure I see that

    $UA->get($url, :read_size_hint => 1024, :content_cb => sub {
        ...
    });

is that much less clear than

    while ($CH->get($url, :chunk => 1024)) {
        ...
    }

but, certainly, if you (Klaus) do there's no harm in a wrapper. In
particular, if you care that the chunks *always* the right size then,
since LWP doesn't guarantee that, a wrapper that does its own buffering
would be necessary.

My problem is that I can't figure out for my life how to write a
closure :content_cb => sub {...} that, after reading one chunk, allows
me to jump completely out of all nested LWP subroutines, and then,
later, if and when I want to read another chunk, I need to jump back
exactly to that point into the closure where I left off. The point is
that reading one chunk is completely separated from reading the
following chunk, such that I could write a perl program that only
reads the first 5 chunks and then decides to stop and not to read any
further. With the closure in LWP::UserAgent I can't do this. As soon
as I say $UA->get($url, :read_size_hint => 1024, :content_cb => sub
{...}) I am commited to reading the whole file, there is nothing I can
do about this.

I think what I need is called "continuations", and I need it in Perl
5.12.
 
C

C.DeRykus

[ snip ]
My problem is that I can't figure out for my life how to write a
closure :content_cb => sub {...} that, after reading one chunk, allows
me to jump completely out of all nested LWP subroutines, and then,
later, if and when I want to read another chunk, I need to jump back
exactly to that point into the closure where I left off. The point is
that reading one chunk is completely separated from reading the
following chunk, such that I could write a perl program that only
reads the first 5 chunks and then decides to stop and not to read any
further. With the closure in LWP::UserAgent I can't do this. As soon
as I say  $UA->get($url, :read_size_hint => 1024, :content_cb => sub
{...}) I am commited to reading the whole file, there is nothing I can
do about this.

Didn't try it but, as for a full read committment,
doesn't the LWP::UserAgent 'Handler' section show
how you might terminate early:

...
response_data => sub { my($response, $ua, $h, $data)
= @_; ... }
This handlers is called for each chunk of data
received for the response. The handler might
** croak ** to abort the request.

This handler need to return a TRUE value to be
called again for subsequent chunks for the same
request.

Also a possibility for "jumping out entirely,then
resuming later" might be to just save a byte count
of how much's been read. Then, when you resume,
set a 'byte-range-resp-spec' header to pick up
 
Ad

Advertisements

K

Klaus

Quoth Klaus said:
My problem is that I can't figure out for my life how to write a
closure :content_cb => sub {...} that, after reading one chunk, allows
me to jump completely out of all nested LWP subroutines, and then,
later, if and when I want to read another chunk, I need to jump back
exactly to that point into the closure where I left off.
[...]
I think what I need is called "continuations", and I need it in Perl
5.12.

Oh! Sorry, I hadn't thought that far.

The obvious answer is 'use Coro', or rather Coro::LWP, but I'm not sure
how far I trust it. The fact it requires a rather invasive set of hacks
to get LWP to work right is... worrying.

The other obvious answer is 'use Net::HTTP::NB'. This will obviously
only support HTTP, and will require you to do all the redirect-following
logic and so on yourself, but gives you a basic non-blocking HTTP
client.

Thanks, Net::HTTP::NB is exactly what I was looking for all along.

....and no need for me to write LWP::Chunk :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top