problem reading html stream

Discussion in 'Perl Misc' started by Dave Saville, Jan 14, 2012.

  1. Dave Saville

    Dave Saville Guest

    I have a perl script that reads a, large, html stream (TV program
    data).

    I use IO::Socket, do a "my $socket = new" and then a "while (
    <$socket>)" to fetch the data.

    Now the problem *might* be their end, but it hangs after *exactly*
    180K for about 5 minutes and then completes. Firefox pulls the same
    data in 10s of seconds. Which, to my thinking, would eliminate any
    funnies in libc.

    Any thoughts?

    TIA
    --
    Regards
    Dave Saville
    Dave Saville, Jan 14, 2012
    #1
    1. Advertising

  2. * Dave Saville wrote in comp.lang.perl.misc:
    >I have a perl script that reads a, large, html stream (TV program
    >data).
    >
    >I use IO::Socket, do a "my $socket = new" and then a "while (
    ><$socket>)" to fetch the data.
    >
    >Now the problem *might* be their end, but it hangs after *exactly*
    >180K for about 5 minutes and then completes. Firefox pulls the same
    >data in 10s of seconds. Which, to my thinking, would eliminate any
    >funnies in libc.


    The error is in what you are not describing, like what <> does in your
    code. By default it looks for newlines and there might be none in the
    stream after a certain point, and the five minutes might simply be the
    timeout where your program gives up waiting for more data.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Jan 14, 2012
    #2
    1. Advertising

  3. Dave Saville

    Dave Saville Guest

    On Sat, 14 Jan 2012 16:56:36 UTC, Bjoern Hoehrmann
    <> wrote:

    <snip>

    > The error is in what you are not describing, like what <> does in your
    > code. By default it looks for newlines and there might be none in the
    > stream after a certain point, and the five minutes might simply be the
    > timeout where your program gives up waiting for more data.


    It parses the input and writes to a file. It's not a timeout as after
    the long wait it carries on to completion. It's processed line by line
    and it is not running out of memory or anything like that. The socket
    is blocked until some more lines eventually arrive.

    One of the URLs I am having problems with is
    xmltv.radiotimes.com/xmltv/94.dat

    Will try an iptrace but I doubt there is any traffic. It is just so
    suspicious that it is *exactly* 180K bytes.

    --
    Regards
    Dave Saville
    Dave Saville, Jan 14, 2012
    #3
  4. * Dave Saville wrote in comp.lang.perl.misc:
    >It parses the input and writes to a file. It's not a timeout as after
    >the long wait it carries on to completion. It's processed line by line
    >and it is not running out of memory or anything like that. The socket
    >is blocked until some more lines eventually arrive.
    >
    >One of the URLs I am having problems with is
    >xmltv.radiotimes.com/xmltv/94.dat
    >
    >Will try an iptrace but I doubt there is any traffic. It is just so
    >suspicious that it is *exactly* 180K bytes.


    My guess is that you are trying to read a HTTP response via IO::Socket
    and that does not work because you are expecting that while(<$socket>)
    knows when it read the "last line" but there is no such thing in HTTP.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Jan 14, 2012
    #4
  5. Dave Saville

    Guest

    On Sat, 14 Jan 2012 15:19:19 +0000 (UTC), "Dave Saville"
    <> wrote:

    >I have a perl script that reads a, large, html stream (TV program
    >data).
    >
    >I use IO::Socket, do a "my $socket = new" and then a "while (
    ><$socket>)" to fetch the data.
    >
    >Now the problem *might* be their end, but it hangs after *exactly*
    >180K for about 5 minutes and then completes. Firefox pulls the same
    >data in 10s of seconds. Which, to my thinking, would eliminate any
    >funnies in libc.
    >
    >Any thoughts?
    >
    >TIA


    Perhaps the IO::Socket module is not your best bet.

    I do something similar and I use LWP::Simple. Streams come right in
    at full bandwith speed.

    Good luck
    , Jan 15, 2012
    #5
  6. Dave Saville

    Dave Saville Guest

    On Sun, 15 Jan 2012 00:28:24 UTC, wrote:

    > On Sat, 14 Jan 2012 15:19:19 +0000 (UTC), "Dave Saville"
    > <> wrote:
    >
    > >I have a perl script that reads a, large, html stream (TV program
    > >data).
    > >
    > >I use IO::Socket, do a "my $socket = new" and then a "while (
    > ><$socket>)" to fetch the data.
    > >
    > >Now the problem *might* be their end, but it hangs after *exactly*
    > >180K for about 5 minutes and then completes. Firefox pulls the same
    > >data in 10s of seconds. Which, to my thinking, would eliminate any
    > >funnies in libc.
    > >
    > >Any thoughts?
    > >
    > >TIA

    >
    > Perhaps the IO::Socket module is not your best bet.
    >
    > I do something similar and I use LWP::Simple. Streams come right in
    > at full bandwith speed.
    >


    It behaves, or misbehaves, with socket and io::socket - but I suppose
    it would the latter being a wrapper for the former. Never thought of
    LWP::Simple as it is not really an HTML page - just data.

    The point several of you seem to have missed is that after the hang at
    180K for minutes the stream resumes with no missing data. I ran an
    iptrace and that showed damn all during the hang. I really think it
    must be the server end - which I have nothing to do with. I have also
    run my code against my own server, although not with such a big file,
    but "normal" HTML pages and it works just fine.
    --
    Regards
    Dave Saville
    Dave Saville, Jan 15, 2012
    #6
  7. On 2012-01-15 11:33, Dave Saville <> wrote:
    > On Sun, 15 Jan 2012 00:28:24 UTC, wrote:
    >> On Sat, 14 Jan 2012 15:19:19 +0000 (UTC), "Dave Saville"
    >> <> wrote:
    >>
    >> >I have a perl script that reads a, large, html stream (TV program
    >> >data).
    >> >
    >> >I use IO::Socket, do a "my $socket = new" and then a "while (
    >> ><$socket>)" to fetch the data.
    >> >
    >> >Now the problem *might* be their end, but it hangs after *exactly*
    >> >180K for about 5 minutes and then completes. Firefox pulls the same
    >> >data in 10s of seconds. Which, to my thinking, would eliminate any
    >> >funnies in libc.

    >>
    >> Perhaps the IO::Socket module is not your best bet.
    >>
    >> I do something similar and I use LWP::Simple. Streams come right in
    >> at full bandwith speed.
    >>

    >
    > It behaves, or misbehaves, with socket and io::socket - but I suppose
    > it would the latter being a wrapper for the former. Never thought of
    > LWP::Simple as it is not really an HTML page - just data.


    Do you use HTTP to get the data or some custom protocol?

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Jan 15, 2012
    #7
  8. Dave Saville

    Dave Saville Guest

    Re: problem reading html stream SOLVED

    On Sun, 15 Jan 2012 12:16:08 UTC, "Peter J. Holzer"
    <> wrote:

    <snip>

    >
    > Do you use HTTP to get the data or some custom protocol?


    HTTP - But it would appear to be a problem with perl sockets - Someone
    suggested LWP::Simple but that was no good as I needed to process the
    files which are large and the server does not have much RAM. So I used
    LWP::UserAgent to dump straight to a file which I can then post
    process and it works fine. Odd as I would have thought that LWP* would
    use sockets at the bottom layer. Ho hum.

    Thanks for the help guys.
    --
    Regards
    Dave Saville
    Dave Saville, Jan 15, 2012
    #8
  9. Re: problem reading html stream SOLVED

    On 2012-01-15 14:01, Dave Saville <> wrote:
    > On Sun, 15 Jan 2012 12:16:08 UTC, "Peter J. Holzer"
    ><> wrote:
    >
    ><snip>
    >
    >> Do you use HTTP to get the data or some custom protocol?

    >
    > HTTP - But it would appear to be a problem with perl sockets - Someone
    > suggested LWP::Simple but that was no good as I needed to process the
    > files which are large and the server does not have much RAM. So I used
    > LWP::UserAgent to dump straight to a file which I can then post
    > process and it works fine. Odd as I would have thought that LWP* would
    > use sockets at the bottom layer. Ho hum.


    It does. You probably made an error in writing your own HTTP
    implementation.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Jan 15, 2012
    #9
  10. Dave Saville

    Dave Saville Guest

    Re: problem reading html stream SOLVED

    On Sun, 15 Jan 2012 15:09:39 UTC, "Peter J. Holzer"
    <> wrote:

    > On 2012-01-15 14:01, Dave Saville <> wrote:
    > > On Sun, 15 Jan 2012 12:16:08 UTC, "Peter J. Holzer"
    > ><> wrote:
    > >
    > ><snip>
    > >
    > >> Do you use HTTP to get the data or some custom protocol?

    > >
    > > HTTP - But it would appear to be a problem with perl sockets - Someone
    > > suggested LWP::Simple but that was no good as I needed to process the
    > > files which are large and the server does not have much RAM. So I used
    > > LWP::UserAgent to dump straight to a file which I can then post
    > > process and it works fine. Odd as I would have thought that LWP* would
    > > use sockets at the bottom layer. Ho hum.

    >
    > It does. You probably made an error in writing your own HTTP
    > implementation.


    That I am willing to believe. Perhaps you would be so kind as to point
    out the error in my code?

    #!/usr/local/bin/perl
    use warnings;
    use strict;
    use Socket;
    open RAW, ">RAW" or die $!;
    my $iaddr = inet_aton('xmltv.radiotimes.com') or die $!;
    socket(SOCK, AF_INET, SOCK_STREAM, getprotobyname('tcp')) or die $!;
    my $paddr = sockaddr_in(80, $iaddr);
    connect(SOCK, $paddr) or die $!;
    send SOCK, "GET /xmltv/94.dat HTTP\/1.1\r\n", 0;
    send SOCK, "Host: xmltv.radiotimes.com\r\n\r\n", 0;
    while ( <SOCK> )
    {
    print RAW $_;
    }
    close SOCK;
    close RAW;

    This hangs for minutes and then completes. I have run the above on two
    different operating systems and they both do exactly the same.

    --
    Regards
    Dave Saville
    Dave Saville, Jan 15, 2012
    #10
  11. Re: problem reading html stream SOLVED

    Dave Saville <> wrote:
    > On Sun, 15 Jan 2012 15:09:39 UTC, "Peter J. Holzer"
    > <> wrote:


    > > On 2012-01-15 14:01, Dave Saville <> wrote:
    > > > On Sun, 15 Jan 2012 12:16:08 UTC, "Peter J. Holzer"
    > > ><> wrote:
    > > >
    > > ><snip>
    > > >
    > > >> Do you use HTTP to get the data or some custom protocol?
    > > >
    > > > HTTP - But it would appear to be a problem with perl sockets - Someone
    > > > suggested LWP::Simple but that was no good as I needed to process the
    > > > files which are large and the server does not have much RAM. So I used
    > > > LWP::UserAgent to dump straight to a file which I can then post
    > > > process and it works fine. Odd as I would have thought that LWP* would
    > > > use sockets at the bottom layer. Ho hum.

    > >
    > > It does. You probably made an error in writing your own HTTP
    > > implementation.


    > That I am willing to believe. Perhaps you would be so kind as to point
    > out the error in my code?


    > #!/usr/local/bin/perl
    > use warnings;
    > use strict;
    > use Socket;
    > open RAW, ">RAW" or die $!;
    > my $iaddr = inet_aton('xmltv.radiotimes.com') or die $!;
    > socket(SOCK, AF_INET, SOCK_STREAM, getprotobyname('tcp')) or die $!;
    > my $paddr = sockaddr_in(80, $iaddr);
    > connect(SOCK, $paddr) or die $!;
    > send SOCK, "GET /xmltv/94.dat HTTP\/1.1\r\n", 0;
    > send SOCK, "Host: xmltv.radiotimes.com\r\n\r\n", 0;
    > while ( <SOCK> )
    > {
    > print RAW $_;
    > }
    > close SOCK;
    > close RAW;
    >
    > This hangs for minutes and then completes. I have run the above on two
    > different operating systems and they both do exactly the same.


    This 180 kB look suspicously like the length of the file the
    server sends. And you're using HTTP 1.1, which allows the sender
    to keep the connection open after it has send a file, waiting
    for the next request unless told otherwise ("persistent connec-
    tion" is actually the defalt with HTTP 1.1). So my guess is that
    the server sends the complete file just fine and waits for the
    the next request. But since your loop only ends when the connec-
    tion is closed by the other side it hangs until the server gets
    bored and closes the connection after a few minutes. So either
    use HTTP 1.0 or send an additional HTTP header with (IIRC)
    "Connection: close\r\n". See also e.g.

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html

    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Jan 15, 2012
    #11
  12. Dave Saville

    Dave Saville Guest

    Re: problem reading html stream SOLVED

    On Sun, 15 Jan 2012 17:32:34 UTC, (Jens Thoms Toerring)
    wrote:

    > This 180 kB look suspicously like the length of the file the
    > server sends. And you're using HTTP 1.1, which allows the sender
    > to keep the connection open after it has send a file, waiting
    > for the next request unless told otherwise ("persistent connec-
    > tion" is actually the defalt with HTTP 1.1). So my guess is that
    > the server sends the complete file just fine and waits for the
    > the next request. But since your loop only ends when the connec-
    > tion is closed by the other side it hangs until the server gets
    > bored and closes the connection after a few minutes. So either
    > use HTTP 1.0 or send an additional HTTP header with (IIRC)
    > "Connection: close\r\n". See also e.g.
    >



    Thank you so much Jens, reverting to 1.0 or adding the header both
    work.
    --
    Regards
    Dave Saville
    Dave Saville, Jan 15, 2012
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rasmusson, Lars
    Replies:
    1
    Views:
    714
    popov
    Apr 30, 2004
  2. Replies:
    9
    Views:
    612
    Alex Buell
    Apr 27, 2006
  3. Alexander Korsunsky

    get stream mode flags from an opened stream

    Alexander Korsunsky, Feb 17, 2007, in forum: C++
    Replies:
    1
    Views:
    442
    John Harrison
    Feb 17, 2007
  4. dolphin
    Replies:
    6
    Views:
    544
    Thomas Fritsch
    Mar 18, 2007
  5. mrstephengross
    Replies:
    3
    Views:
    390
    James Kanze
    May 10, 2007
Loading...

Share This Page