Opening files on the web for reading

G

Graham Stow

Can anyone give me some Perl code to open an html file on the web (i.e. an
html file stored on somebody elses web server and not mine), for reading. Or
is it more complicated than that?
 
P

Peter Makholm

Graham Stow said:
Can anyone give me some Perl code to open an html file on the web (i.e. an
html file stored on somebody elses web server and not mine), for reading. Or
is it more complicated than that?

You can use the LWP::Simple module. The example in the documentation
should tell you how to do it.

//Makholm
 
J

Jürgen Exner

Graham Stow said:
Can anyone give me some Perl code to open an html file on the web (i.e. an
html file stored on somebody elses web server and not mine), for reading. Or
is it more complicated than that?

Is there anything wrong with the answer in "perldoc -q HTML":

How do I fetch an HTML file?

jue
 
X

xhoster

Jürgen Exner said:
Is there anything wrong with the answer in "perldoc -q HTML":

How do I fetch an HTML file?

Other than it not answering the question? At least on my Perl version,
none of the answers there return a file handle opened for reading. Now
maybe he is fine with downloading the entire file (either to disk or to
memory) and then reading from that, but I'd be inclined to give the benefit
of the doubt that he meant what he asked.

LWP::UserAgent using a callback with for example :content_cb would "stream"
the data back, but not via a file handle. One could probably come up with
an adaptor that ties a file handle front end to the callback backend.

There might be a more direct way, but I don't know what it is.




Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

Other than it not answering the question? At least on my Perl version,
none of the answers there return a file handle opened for reading. Now
maybe he is fine with downloading the entire file (either to disk or to
memory) and then reading from that, but I'd be inclined to give the benefit
of the doubt that he meant what he asked.

Fair enough. I interpreted "to open an html file on the web [...] for
reading" as he just wants to get he content of that file (which as we
all know may not be a file in the first place), not to actually have a
read file handle to a URL.
At the very least his terminology is sloppy and your interpretation may
very well be closer to his intentions.

jue
 
B

Ben Morrow

Quoth (e-mail address removed):
Other than it not answering the question? At least on my Perl version,
none of the answers there return a file handle opened for reading. Now
maybe he is fine with downloading the entire file (either to disk or to
memory) and then reading from that, but I'd be inclined to give the benefit
of the doubt that he meant what he asked.

LWP::UserAgent using a callback with for example :content_cb would "stream"
the data back, but not via a file handle. One could probably come up with
an adaptor that ties a file handle front end to the callback backend.

There might be a more direct way, but I don't know what it is.

IO::All::LWP

Ben
 
T

Tim Greer

Graham said:
Can anyone give me some Perl code to open an html file on the web
(i.e. an html file stored on somebody elses web server and not mine),
for reading. Or is it more complicated than that?

Are you just looking to read it and maybe check something, or parse it,
or download it/save it? There are many methods, but the best one could
depend on what your goals are.
 
C

C.DeRykus

...

LWP::UserAgent using a callback with for example :content_cb would "stream"
the data back, but not via a file handle. One could probably come up with
an adaptor that ties a file handle front end to the callback backend.

There might be a more direct way, but I don't know what it is.
S
Another possibility but still indirect
(and w/o graceful error handling):

use LWP::Simple;
my $pid = open( my $fh, "-|" );
die "fork: $!" unless defined $pid;
if ($pid ) { while <$fh> { ... } }
else { getprint( ...); }
....
 
T

Ted Zlatanov

BM> IO::All::LWP

Unfortunately, the docs say "The bad news is that the whole file is
stored in memory after getting it or before putting it. This may cause
problems if you are dealing with multi-gigabyte files!"

It would be nice to have a buffered reader/writer which wouldn't grab
the whole file, using the LWP callbacks, as xhoster suggests... I
haven't seen such a module.

Ted
 
X

xhoster

Ted Zlatanov said:
BM> IO::All::LWP

Unfortunately, the docs say "The bad news is that the whole file is
stored in memory after getting it or before putting it. This may cause
problems if you are dealing with multi-gigabyte files!"

It would be nice to have a buffered reader/writer which wouldn't grab
the whole file, using the LWP callbacks, as xhoster suggests... I
haven't seen such a module.

And it doesn't seem as easy as I thought. In order for the callback to be
invoked, the thing invoking the callback has to be "in control". But to
read from a file handle, the thing reading is in control. You'd have to
fork a process and in one have the callback invoker in control, streaming
data to the other process as it comes in and the callback is invoked. So
then you would have portability problems.

It seems like it is easy to write a wrapper that turns an iterator into a
callback, but vice versa is not easy.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
T

Ted Zlatanov

x> And it doesn't seem as easy as I thought. In order for the callback to be
x> invoked, the thing invoking the callback has to be "in control". But to
x> read from a file handle, the thing reading is in control. You'd have to
x> fork a process and in one have the callback invoker in control, streaming
x> data to the other process as it comes in and the callback is invoked. So
x> then you would have portability problems.

You can do it with buffering but it's ugly code I would not want to
write. It's very easy to get it wrong.

x> It seems like it is easy to write a wrapper that turns an iterator into a
x> callback, but vice versa is not easy.

Right, since iterators are stateful, so you have to manufacture and
preserve the state when you only have a callback.

Ted
 
E

Eric Pozharski

Ted Zlatanov said:
Unfortunately, the docs say "The bad news is that the whole file is
stored in memory after getting it or before putting it. This may cause
problems if you are dealing with multi-gigabyte files!"
It would be nice to have a buffered reader/writer which wouldn't grab
the whole file, using the LWP callbacks, as xhoster suggests... I
haven't seen such a module.

Obviously I've got something wrong (or, as ever, I'm incompetent). The
server must have means to be told stop-feeding/resume-feeding. Or (in
case I understand networking a least bit) those gigabytes would be
buffered in kernel. What I don't know?
 
B

Ben Morrow

Quoth (e-mail address removed):
And it doesn't seem as easy as I thought. In order for the callback to be
invoked, the thing invoking the callback has to be "in control". But to
read from a file handle, the thing reading is in control.

So use Net::HTTP::NB. Not quite as convenient as LWP::UA, but it
provides non-blocking reads.

It's a real shame Perl doesn't have a decent lightweight userland thread
library, as this sort of thing is exactly what it would be useful for.
If I *wanted* to write select loops, I'd be writing C; since I'm writing
Perl, it would be nice if perl could handle the messy stuff for me :).

Ben
 
B

Ben Morrow

Quoth Ted Zlatanov said:
On 25 Sep 2008 15:23:24 GMT (e-mail address removed) wrote:

x> It seems like it is easy to write a wrapper that turns an iterator into a
x> callback, but vice versa is not easy.

Right, since iterators are stateful, so you have to manufacture and
preserve the state when you only have a callback.

That's not the issue: callbacks in Perl are closures, so they do have
state. The trouble is that you would need LWP::UserAgent->simple_request
and whatever is driving the <$FH> loop to be coroutines, and Perl
doesn't have 'yield'.

Just for fun, here's an implementation using Coro:

#!/usr/bin/perl

use warnings;
use strict;

{
package LWP::FH;

use Coro;
use Coro::Channel;
use LWP::UserAgent;

use overload '<>' => sub {
my ($s) = @_;
my $eol;
until (($eol = length($/) + index $s->{buf}, $/) > 0) {
my $new = $s->{ch}->get;
if (defined $new) {
$s->{buf} .= $new;
}
else {
$eol = length $s->{buf};
last;
}
}
return substr $s->{buf}, 0, $eol, "";
};

my $UA = LWP::UserAgent->new;

sub new {
my ($c, $url) = @_;
my $s = bless {
buf => "",
ch => Coro::Channel->new(1),
}, $c;
async {
my ($UA, $s) = @_;
$UA->get(
$url,
":content_cb" => sub {
$s->{ch}->put($_[0]);
},
);
$s->{ch}->put(undef);
} $UA, $s;
return $s;
}
}

my $FH = LWP::FH->new("http://perl.org");
while (<$FH>) {
print "LINE: $_";
}

__END__

Ben
 
B

Ben Morrow

T

Ted Zlatanov

BM> Yes. See
BM> http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control
BM> .

BM> Once the kernel buffers are full, the receiving end instructs the
BM> sending end to stop sending data.

Also, HTTP 1.1 supports partial transfers of data, so you can open a
persistent connection and keep requesting small pieces. I'd guess it's
better that TCP flow control if the goal was to allow random seeks, not
just sequential writes. Handling errors and chunk boundaries would
be... let's say "interesting to the right developer." :)

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top