Capturing actual Browser output in perl

Discussion in 'Perl Misc' started by digz, May 22, 2009.

  1. digz

    digz Guest

    #!/usr/bin/perl
    use LWP;
    my $browser = LWP::UserAgent->new;
    my $response = $browser->get( "http://lkml.org" );
    print( $response->content );

    In this program I am trying to get the output as the browser displays
    it , not the actual HTML page with all the tags .., that $response-
    >content returns.


    For a example , this URL ,

    What I want to save in a string is how the browser shows it

    Last 100 messages Today's messages Yesterday's messages
    Hottest Messages
    LKML.ORG

    NOT

    what the actual HTML content is:

    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
    <meta http-equiv="Content-Type" content="text/html;
    charset=UTF-8" />
    <link href="/css/frontpage.css" rel="stylesheet" type="text/css" /
    >

    <title>LKML.ORG - the Linux Kernel Mailing List Archive</title>
    <script type="text/javascript" src="/css/multiline-tooltip.js"></
    script>
    </head>
    ......
    Is there any easy way to achieve this

    Thanks

    Digz
     
    digz, May 22, 2009
    #1
    1. Advertising

  2. digz <> wrote:
    >#!/usr/bin/perl
    >use LWP;
    >my $browser = LWP::UserAgent->new;
    >my $response = $browser->get( "http://lkml.org" );
    >print( $response->content );
    >
    >In this program I am trying to get the output as the browser displays
    >it , not the actual HTML page with all the tags .., that $response-
    >>content returns.


    The way you stated your requirements your best bet is a screen capture
    tool, because the output of a browser depends not only on the HTML but
    to a large part on user settings and configurations.
    Therefore a different rendering tool would have to use the same
    configuration as the browser and interpret them the same way.

    >For a example , this URL ,
    >
    >What I want to save in a string is how the browser shows it


    But a browser shows a a graphic with different fonts, styles, colors,
    layouts, tables, ....
    You cannot save that as a "text string" (unless you incorporate that
    formatting information in the string, of course, but then it is no
    longer plain text).

    >Last 100 messages Today's messages Yesterday's messages
    >Hottest Messages
    >LKML.ORG
    >
    >NOT
    >
    >what the actual HTML content is:
    >.....
    >Is there any easy way to achieve this


    The easiest way to get an approximation of the textual part of the
    display is to use a text-only browser like e.g. Lynx and redirect its
    output to a file (Lynx has an option for that).

    Another way, probably more customizable (what do you intent to do with
    tool tips? Alternate text and captures for graphics? DHTML? How much
    JavaScript do you want to run? ...?) is to run the HTML code through an
    HTML parser and extract those text pieces you are interested in. THere
    are several parsers on CPAN.
     
    Jürgen Exner, May 22, 2009
    #2
    1. Advertising

  3. digz wrote:
    > #!/usr/bin/perl
    > use LWP;
    > my $browser = LWP::UserAgent->new;
    > my $response = $browser->get( "http://lkml.org" );
    > print( $response->content );
    >
    > In this program I am trying to get the output as the browser displays
    > it , not the actual HTML page with all the tags .., that $response-
    > content returns.


    You may want to check out:

    http://search.cpan.org/dist/html2text/

    http://search.cpan.org/perldoc?HTML::FormatText::Html2text

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, May 22, 2009
    #3
  4. In Dread Ink, the Grave Hand of digz Did Inscribe:

    > In this program I am trying to get the output as the browser displays
    > it , not the actual HTML page with all the tags .., that
    > $response->content returns.


    I was endeavoring close to the same thing a while back, and I think this
    was the closest I came:

    #!/usr/bin/perl
    # perl wahab4.pl

    use strict;
    use warnings;
    use LWP::Simple;
    use HTML::parser;
    use HTML::FormatText;
    my ($html, $ascii);
    $html = get("http://www.co-array.com/");
    defined $html
    or die "Can't fetch HTML from http://www.perl.com/";
    $ascii = HTML::FormatText->new->format(parse_html($html));
    print $ascii;


    C:\MinGW\source>perl wahab4.pl
    Undefined subroutine &main::parse_html called at wahab4.pl line 12.

    I'm having trouble using the methods that are on cpan. I sure wish every
    module included a bevy of examples.
    --
    Frank

    No Child Left Behind is the most ironically named act, piece of legislation
    since the 1942 Japanese Family Leave Act.
    ~~ Al Franken, in response to the 2004 SOTU address
     
    Franken Sense, May 24, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rick Strahl [MVP]

    Capturing ASPX output from another page

    Rick Strahl [MVP], Jul 9, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    492
    Steve C. Orr, MCSD
    Jul 9, 2003
  2. Andy Fish

    capturing the output of a JSP

    Andy Fish, Feb 11, 2004, in forum: Java
    Replies:
    4
    Views:
    2,022
    Chris Smith
    Feb 11, 2004
  3. Replies:
    0
    Views:
    400
  4. Replies:
    2
    Views:
    537
  5. Andy Fish
    Replies:
    0
    Views:
    3,470
    Andy Fish
    Jul 30, 2003
Loading...

Share This Page