Capturing actual Browser output in perl

D

digz

#!/usr/bin/perl
use LWP;
my $browser = LWP::UserAgent->new;
my $response = $browser->get( "http://lkml.org" );
print( $response->content );

In this program I am trying to get the output as the browser displays
it , not the actual HTML page with all the tags .., that $response-
content returns.

For a example , this URL ,

What I want to save in a string is how the browser shows it

Last 100 messages Today's messages Yesterday's messages
Hottest Messages
LKML.ORG

NOT

what the actual HTML content is:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" />
<link href="/css/frontpage.css" rel="stylesheet" type="text/css" /<title>LKML.ORG - the Linux Kernel Mailing List Archive</title>
<script type="text/javascript" src="/css/multiline-tooltip.js"></
script>
</head>
......
Is there any easy way to achieve this

Thanks

Digz
 
J

Jürgen Exner

digz said:
#!/usr/bin/perl
use LWP;
my $browser = LWP::UserAgent->new;
my $response = $browser->get( "http://lkml.org" );
print( $response->content );

In this program I am trying to get the output as the browser displays
it , not the actual HTML page with all the tags .., that $response-

The way you stated your requirements your best bet is a screen capture
tool, because the output of a browser depends not only on the HTML but
to a large part on user settings and configurations.
Therefore a different rendering tool would have to use the same
configuration as the browser and interpret them the same way.
For a example , this URL ,

What I want to save in a string is how the browser shows it

But a browser shows a a graphic with different fonts, styles, colors,
layouts, tables, ....
You cannot save that as a "text string" (unless you incorporate that
formatting information in the string, of course, but then it is no
longer plain text).
Last 100 messages Today's messages Yesterday's messages
Hottest Messages
LKML.ORG

NOT

what the actual HTML content is:
.....
Is there any easy way to achieve this

The easiest way to get an approximation of the textual part of the
display is to use a text-only browser like e.g. Lynx and redirect its
output to a file (Lynx has an option for that).

Another way, probably more customizable (what do you intent to do with
tool tips? Alternate text and captures for graphics? DHTML? How much
JavaScript do you want to run? ...?) is to run the HTML code through an
HTML parser and extract those text pieces you are interested in. THere
are several parsers on CPAN.
 
F

Franken Sense

In Dread Ink, the Grave Hand of digz Did Inscribe:
In this program I am trying to get the output as the browser displays
it , not the actual HTML page with all the tags .., that
$response->content returns.

I was endeavoring close to the same thing a while back, and I think this
was the closest I came:

#!/usr/bin/perl
# perl wahab4.pl

use strict;
use warnings;
use LWP::Simple;
use HTML::parser;
use HTML::FormatText;
my ($html, $ascii);
$html = get("http://www.co-array.com/");
defined $html
or die "Can't fetch HTML from http://www.perl.com/";
$ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;


C:\MinGW\source>perl wahab4.pl
Undefined subroutine &main::parse_html called at wahab4.pl line 12.

I'm having trouble using the methods that are on cpan. I sure wish every
module included a bevy of examples.
--
Frank

No Child Left Behind is the most ironically named act, piece of legislation
since the 1942 Japanese Family Leave Act.
~~ Al Franken, in response to the 2004 SOTU address
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top