Opening files on the web for reading

Discussion in 'Perl Misc' started by Graham Stow, Sep 24, 2008.

  1. Graham Stow

    Graham Stow Guest

    Can anyone give me some Perl code to open an html file on the web (i.e. an
    html file stored on somebody elses web server and not mine), for reading. Or
    is it more complicated than that?
     
    Graham Stow, Sep 24, 2008
    #1
    1. Advertising

  2. "Graham Stow" <> writes:

    > Can anyone give me some Perl code to open an html file on the web (i.e. an
    > html file stored on somebody elses web server and not mine), for reading. Or
    > is it more complicated than that?


    You can use the LWP::Simple module. The example in the documentation
    should tell you how to do it.

    //Makholm
     
    Peter Makholm, Sep 24, 2008
    #2
    1. Advertising

  3. "Graham Stow" <> wrote:
    >Can anyone give me some Perl code to open an html file on the web (i.e. an
    >html file stored on somebody elses web server and not mine), for reading. Or
    >is it more complicated than that?


    Is there anything wrong with the answer in "perldoc -q HTML":

    How do I fetch an HTML file?

    jue
     
    Jürgen Exner, Sep 24, 2008
    #3
  4. Graham Stow

    Guest

    Jürgen Exner <> wrote:
    > "Graham Stow" <> wrote:
    > >Can anyone give me some Perl code to open an html file on the web (i.e.
    > >an html file stored on somebody elses web server and not mine), for
    > >reading. Or is it more complicated than that?

    >
    > Is there anything wrong with the answer in "perldoc -q HTML":
    >
    > How do I fetch an HTML file?


    Other than it not answering the question? At least on my Perl version,
    none of the answers there return a file handle opened for reading. Now
    maybe he is fine with downloading the entire file (either to disk or to
    memory) and then reading from that, but I'd be inclined to give the benefit
    of the doubt that he meant what he asked.

    LWP::UserAgent using a callback with for example :content_cb would "stream"
    the data back, but not via a file handle. One could probably come up with
    an adaptor that ties a file handle front end to the callback backend.

    There might be a more direct way, but I don't know what it is.




    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Sep 24, 2008
    #4
  5. wrote:
    >Jürgen Exner <> wrote:
    >> "Graham Stow" <> wrote:
    >> >Can anyone give me some Perl code to open an html file on the web (i.e.
    >> >an html file stored on somebody elses web server and not mine), for
    >> >reading. Or is it more complicated than that?

    >>
    >> Is there anything wrong with the answer in "perldoc -q HTML":
    >>
    >> How do I fetch an HTML file?

    >
    >Other than it not answering the question? At least on my Perl version,
    >none of the answers there return a file handle opened for reading. Now
    >maybe he is fine with downloading the entire file (either to disk or to
    >memory) and then reading from that, but I'd be inclined to give the benefit
    >of the doubt that he meant what he asked.


    Fair enough. I interpreted "to open an html file on the web [...] for
    reading" as he just wants to get he content of that file (which as we
    all know may not be a file in the first place), not to actually have a
    read file handle to a URL.
    At the very least his terminology is sloppy and your interpretation may
    very well be closer to his intentions.

    jue
     
    Jürgen Exner, Sep 24, 2008
    #5
  6. Graham Stow

    Ben Morrow Guest

    Quoth :
    > Jürgen Exner <> wrote:
    > > "Graham Stow" <> wrote:
    > > >Can anyone give me some Perl code to open an html file on the web (i.e.
    > > >an html file stored on somebody elses web server and not mine), for
    > > >reading. Or is it more complicated than that?

    > >
    > > Is there anything wrong with the answer in "perldoc -q HTML":
    > >
    > > How do I fetch an HTML file?

    >
    > Other than it not answering the question? At least on my Perl version,
    > none of the answers there return a file handle opened for reading. Now
    > maybe he is fine with downloading the entire file (either to disk or to
    > memory) and then reading from that, but I'd be inclined to give the benefit
    > of the doubt that he meant what he asked.
    >
    > LWP::UserAgent using a callback with for example :content_cb would "stream"
    > the data back, but not via a file handle. One could probably come up with
    > an adaptor that ties a file handle front end to the callback backend.
    >
    > There might be a more direct way, but I don't know what it is.


    IO::All::LWP

    Ben

    --
    The Earth is degenerating these days. Bribery and corruption abound.
    Children no longer mind their parents, every man wants to write a book,
    and it is evident that the end of the world is fast approaching.
    Assyrian stone tablet, c.2800 BC
     
    Ben Morrow, Sep 24, 2008
    #6
  7. Graham Stow

    Tim Greer Guest

    Graham Stow wrote:

    > Can anyone give me some Perl code to open an html file on the web
    > (i.e. an html file stored on somebody elses web server and not mine),
    > for reading. Or is it more complicated than that?


    Are you just looking to read it and maybe check something, or parse it,
    or download it/save it? There are many methods, but the best one could
    depend on what your goals are.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
     
    Tim Greer, Sep 24, 2008
    #7
  8. Graham Stow

    C.DeRykus Guest

    On Sep 24, 8:47 am, wrote:
    > ...
    >
    > LWP::UserAgent using a callback with for example :content_cb would "stream"
    > the data back, but not via a file handle. One could probably come up with
    > an adaptor that ties a file handle front end to the callback backend.
    >
    > There might be a more direct way, but I don't know what it is.
    >

    S
    Another possibility but still indirect
    (and w/o graceful error handling):

    use LWP::Simple;
    my $pid = open( my $fh, "-|" );
    die "fork: $!" unless defined $pid;
    if ($pid ) { while <$fh> { ... } }
    else { getprint( ...); }
    ....



    --
    Charles DeRykus
     
    C.DeRykus, Sep 25, 2008
    #8
  9. Graham Stow

    Ted Zlatanov Guest

    On Wed, 24 Sep 2008 18:25:48 +0100 Ben Morrow <> wrote:

    BM> Quoth :
    >> Jürgen Exner <> wrote:
    >> > "Graham Stow" <> wrote:
    >> > >Can anyone give me some Perl code to open an html file on the web (i.e.
    >> > >an html file stored on somebody elses web server and not mine), for
    >> > >reading. Or is it more complicated than that?
    >> >
    >> > Is there anything wrong with the answer in "perldoc -q HTML":
    >> >
    >> > How do I fetch an HTML file?

    >>
    >> Other than it not answering the question? At least on my Perl version,
    >> none of the answers there return a file handle opened for reading. Now
    >> maybe he is fine with downloading the entire file (either to disk or to
    >> memory) and then reading from that, but I'd be inclined to give the benefit
    >> of the doubt that he meant what he asked.
    >>
    >> LWP::UserAgent using a callback with for example :content_cb would "stream"
    >> the data back, but not via a file handle. One could probably come up with
    >> an adaptor that ties a file handle front end to the callback backend.
    >>
    >> There might be a more direct way, but I don't know what it is.


    BM> IO::All::LWP

    Unfortunately, the docs say "The bad news is that the whole file is
    stored in memory after getting it or before putting it. This may cause
    problems if you are dealing with multi-gigabyte files!"

    It would be nice to have a buffered reader/writer which wouldn't grab
    the whole file, using the LWP callbacks, as xhoster suggests... I
    haven't seen such a module.

    Ted
     
    Ted Zlatanov, Sep 25, 2008
    #9
  10. Graham Stow

    Guest

    Ted Zlatanov <> wrote:
    > On Wed, 24 Sep 2008 18:25:48 +0100 Ben Morrow <> wrote:
    >
    > BM> IO::All::LWP
    >
    > Unfortunately, the docs say "The bad news is that the whole file is
    > stored in memory after getting it or before putting it. This may cause
    > problems if you are dealing with multi-gigabyte files!"
    >
    > It would be nice to have a buffered reader/writer which wouldn't grab
    > the whole file, using the LWP callbacks, as xhoster suggests... I
    > haven't seen such a module.


    And it doesn't seem as easy as I thought. In order for the callback to be
    invoked, the thing invoking the callback has to be "in control". But to
    read from a file handle, the thing reading is in control. You'd have to
    fork a process and in one have the callback invoker in control, streaming
    data to the other process as it comes in and the callback is invoked. So
    then you would have portability problems.

    It seems like it is easy to write a wrapper that turns an iterator into a
    callback, but vice versa is not easy.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Sep 25, 2008
    #10
  11. Graham Stow

    Ted Zlatanov Guest

    On 25 Sep 2008 15:23:24 GMT wrote:

    x> Ted Zlatanov <> wrote:
    >> It would be nice to have a buffered reader/writer which wouldn't grab
    >> the whole file, using the LWP callbacks, as xhoster suggests... I
    >> haven't seen such a module.


    x> And it doesn't seem as easy as I thought. In order for the callback to be
    x> invoked, the thing invoking the callback has to be "in control". But to
    x> read from a file handle, the thing reading is in control. You'd have to
    x> fork a process and in one have the callback invoker in control, streaming
    x> data to the other process as it comes in and the callback is invoked. So
    x> then you would have portability problems.

    You can do it with buffering but it's ugly code I would not want to
    write. It's very easy to get it wrong.

    x> It seems like it is easy to write a wrapper that turns an iterator into a
    x> callback, but vice versa is not easy.

    Right, since iterators are stateful, so you have to manufacture and
    preserve the state when you only have a callback.

    Ted
     
    Ted Zlatanov, Sep 25, 2008
    #11
  12. Ted Zlatanov <> wrote:
    *SKIP*
    > Unfortunately, the docs say "The bad news is that the whole file is
    > stored in memory after getting it or before putting it. This may cause
    > problems if you are dealing with multi-gigabyte files!"


    > It would be nice to have a buffered reader/writer which wouldn't grab
    > the whole file, using the LWP callbacks, as xhoster suggests... I
    > haven't seen such a module.


    Obviously I've got something wrong (or, as ever, I'm incompetent). The
    server must have means to be told stop-feeding/resume-feeding. Or (in
    case I understand networking a least bit) those gigabytes would be
    buffered in kernel. What I don't know?

    --
    Torvalds' goal for Linux is very simple: World Domination
     
    Eric Pozharski, Sep 25, 2008
    #12
  13. Graham Stow

    Ben Morrow Guest

    Quoth :
    > Ted Zlatanov <> wrote:
    > > On Wed, 24 Sep 2008 18:25:48 +0100 Ben Morrow <> wrote:
    > >
    > > BM> IO::All::LWP
    > >
    > > Unfortunately, the docs say "The bad news is that the whole file is
    > > stored in memory after getting it or before putting it. This may cause
    > > problems if you are dealing with multi-gigabyte files!"
    > >
    > > It would be nice to have a buffered reader/writer which wouldn't grab
    > > the whole file, using the LWP callbacks, as xhoster suggests... I
    > > haven't seen such a module.

    >
    > And it doesn't seem as easy as I thought. In order for the callback to be
    > invoked, the thing invoking the callback has to be "in control". But to
    > read from a file handle, the thing reading is in control.


    So use Net::HTTP::NB. Not quite as convenient as LWP::UA, but it
    provides non-blocking reads.

    It's a real shame Perl doesn't have a decent lightweight userland thread
    library, as this sort of thing is exactly what it would be useful for.
    If I *wanted* to write select loops, I'd be writing C; since I'm writing
    Perl, it would be nice if perl could handle the messy stuff for me :).

    Ben

    --
    'Deserve [death]? I daresay he did. Many live that deserve death. And some die
    that deserve life. Can you give it to them? Then do not be too eager to deal
    out death in judgement. For even the very wise cannot see all ends.'
     
    Ben Morrow, Sep 25, 2008
    #13
  14. Graham Stow

    Ben Morrow Guest

    Quoth Ted Zlatanov <>:
    > On 25 Sep 2008 15:23:24 GMT wrote:
    >
    > x> It seems like it is easy to write a wrapper that turns an iterator into a
    > x> callback, but vice versa is not easy.
    >
    > Right, since iterators are stateful, so you have to manufacture and
    > preserve the state when you only have a callback.


    That's not the issue: callbacks in Perl are closures, so they do have
    state. The trouble is that you would need LWP::UserAgent->simple_request
    and whatever is driving the <$FH> loop to be coroutines, and Perl
    doesn't have 'yield'.

    Just for fun, here's an implementation using Coro:

    #!/usr/bin/perl

    use warnings;
    use strict;

    {
    package LWP::FH;

    use Coro;
    use Coro::Channel;
    use LWP::UserAgent;

    use overload '<>' => sub {
    my ($s) = @_;
    my $eol;
    until (($eol = length($/) + index $s->{buf}, $/) > 0) {
    my $new = $s->{ch}->get;
    if (defined $new) {
    $s->{buf} .= $new;
    }
    else {
    $eol = length $s->{buf};
    last;
    }
    }
    return substr $s->{buf}, 0, $eol, "";
    };

    my $UA = LWP::UserAgent->new;

    sub new {
    my ($c, $url) = @_;
    my $s = bless {
    buf => "",
    ch => Coro::Channel->new(1),
    }, $c;
    async {
    my ($UA, $s) = @_;
    $UA->get(
    $url,
    ":content_cb" => sub {
    $s->{ch}->put($_[0]);
    },
    );
    $s->{ch}->put(undef);
    } $UA, $s;
    return $s;
    }
    }

    my $FH = LWP::FH->new("http://perl.org");
    while (<$FH>) {
    print "LINE: $_";
    }

    __END__

    Ben

    --
    If you put all the prophets, | You'd have so much more reason
    Mystics and saints | Than ever was born
    In one room together, | Out of all of the conflicts of time.
    The Levellers, 'Believers'
     
    Ben Morrow, Sep 25, 2008
    #14
  15. Graham Stow

    Ben Morrow Guest

    Quoth Eric Pozharski <>:
    > Ted Zlatanov <> wrote:
    > *SKIP*
    > > Unfortunately, the docs say "The bad news is that the whole file is
    > > stored in memory after getting it or before putting it. This may cause
    > > problems if you are dealing with multi-gigabyte files!"

    >
    > > It would be nice to have a buffered reader/writer which wouldn't grab
    > > the whole file, using the LWP callbacks, as xhoster suggests... I
    > > haven't seen such a module.

    >
    > Obviously I've got something wrong (or, as ever, I'm incompetent). The
    > server must have means to be told stop-feeding/resume-feeding.


    Yes. See
    http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control
    ..

    > Or (in
    > case I understand networking a least bit) those gigabytes would be
    > buffered in kernel.


    Once the kernel buffers are full, the receiving end instructs the
    sending end to stop sending data.

    Ben

    --
    "Faith has you at a disadvantage, Buffy."
    "'Cause I'm not crazy, or 'cause I don't kill people?"
    "Both, actually."
    []
     
    Ben Morrow, Sep 26, 2008
    #15
  16. Graham Stow

    Ted Zlatanov Guest

    On Fri, 26 Sep 2008 02:55:49 +0100 Ben Morrow <> wrote:

    BM> Quoth Eric Pozharski <>:
    >> Ted Zlatanov <> wrote:
    >> *SKIP*
    >> > Unfortunately, the docs say "The bad news is that the whole file is
    >> > stored in memory after getting it or before putting it. This may cause
    >> > problems if you are dealing with multi-gigabyte files!"

    >>
    >> > It would be nice to have a buffered reader/writer which wouldn't grab
    >> > the whole file, using the LWP callbacks, as xhoster suggests... I
    >> > haven't seen such a module.

    >>
    >> Obviously I've got something wrong (or, as ever, I'm incompetent). The
    >> server must have means to be told stop-feeding/resume-feeding.


    BM> Yes. See
    BM> http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control
    BM> .

    >> Or (in
    >> case I understand networking a least bit) those gigabytes would be
    >> buffered in kernel.


    BM> Once the kernel buffers are full, the receiving end instructs the
    BM> sending end to stop sending data.

    Also, HTTP 1.1 supports partial transfers of data, so you can open a
    persistent connection and keep requesting small pieces. I'd guess it's
    better that TCP flow control if the goal was to allow random seeks, not
    just sequential writes. Handling errors and chunk boundaries would
    be... let's say "interesting to the right developer." :)

    Ted
     
    Ted Zlatanov, Sep 26, 2008
    #16
  17. What's worth reading online? (was: Opening files on the web for reading)

    Ben Morrow <> wrote:

    > Quoth Eric Pozharski <>:
    >> Ted Zlatanov <> wrote:
    >> *SKIP*
    >> > Unfortunately, the docs say "The bad news is that the whole file is
    >> > stored in memory after getting it or before putting it. This may
    >> > cause problems if you are dealing with multi-gigabyte files!"

    >>
    >> > It would be nice to have a buffered reader/writer which wouldn't
    >> > grab the whole file, using the LWP callbacks, as xhoster
    >> > suggests... I haven't seen such a module.

    >>
    >> Obviously I've got something wrong (or, as ever, I'm incompetent).
    >> The server must have means to be told stop-feeding/resume-feeding.

    > Yes. See
    > http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control
    > .
    >> Or (in case I understand networking a least bit) those gigabytes
    >> would be buffered in kernel.

    > Once the kernel buffers are full, the receiving end instructs the
    > sending end to stop sending data.


    Aha, pleased to hear that. What's worse that've read almost (or all?)
    dead trees I've found. So $Subject.


    --
    Torvalds' goal for Linux is very simple: World Domination
     
    Eric Pozharski, Sep 26, 2008
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rob Mayo
    Replies:
    0
    Views:
    280
    Rob Mayo
    Aug 21, 2003
  2. Dan
    Replies:
    0
    Views:
    326
  3. Replies:
    0
    Views:
    802
  4. jonesy
    Replies:
    3
    Views:
    972
    jonesy
    Oct 24, 2006
  5. fniles
    Replies:
    0
    Views:
    290
    fniles
    Apr 26, 2009
Loading...

Share This Page