html2text but preserving text links

F

Felix

The script below tries converting html to text while preserving
textual links. However, with www.news.com it renders many lines all
bunched together as opposed to neatly separated as lynx would do with
the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
Anybody know how I can improve the script? Using backticks with lynx
is not an option, btw.

Thank you very much.

Marc

SCRIPT:

#!/usr/bin/perl

use LWP::Simple;
use HTML::TagFilter;

$content = get ("http://www.news.com");

my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}});
$content = $tf->filter($content);

print $content; exit;
 
J

James Willmore

The script below tries converting html to text while preserving
textual links. However, with www.news.com it renders many lines all
bunched together as opposed to neatly separated as lynx would do with
the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
Anybody know how I can improve the script? Using backticks with lynx
is not an option, btw.

What exactly are you trying to do? I understand the output doesn't look
the greatest, but I'm not seeing your point. Are you trying to extract
just the links? text?

What command line options are you using with lynx? That may help.

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
World War Three can be averted by adherence to a strictly
enforced dress code!
 
B

Bob Walton

Felix said:
The script below tries converting html to text while preserving
textual links. However, with www.news.com it renders many lines all
bunched together as opposed to neatly separated as lynx would do with
the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
Anybody know how I can improve the script? Using backticks with lynx
is not an option, btw.
....

Sounds like maybe you could:

use Text::Wrap;

to get more presentable text?
 
F

Felix

I figured out a way to do this. Basically by using the existing code
and converting div tags into <br> tags first.

Thanks for getting back to me.

Marc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top