html2text but preserving text links

Felix · Feb 21, 2004

The script below tries converting html to text while preserving
textual links. However, with www.news.com it renders many lines all
bunched together as opposed to neatly separated as lynx would do with
the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
Anybody know how I can improve the script? Using backticks with lynx
is not an option, btw.

Thank you very much.

Marc

SCRIPT:

#!/usr/bin/perl

use LWP::Simple;
use HTML::TagFilter;

$content = get ("http://www.news.com");

my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}});
$content = $tf->filter($content);

print $content; exit;

James Willmore · Feb 21, 2004

The script below tries converting html to text while preserving
textual links. However, with www.news.com it renders many lines all
bunched together as opposed to neatly separated as lynx would do with
the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
Anybody know how I can improve the script? Using backticks with lynx
is not an option, btw.

What exactly are you trying to do? I understand the output doesn't look
the greatest, but I'm not seeing your point. Are you trying to extract
just the links? text?

What command line options are you using with lynx? That may help.

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
World War Three can be averted by adherence to a strictly
enforced dress code!

Bob Walton · Feb 23, 2004

Felix said:
The script below tries converting html to text while preserving
textual links. However, with www.news.com it renders many lines all
bunched together as opposed to neatly separated as lynx would do with
the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
Anybody know how I can improve the script? Using backticks with lynx
is not an option, btw.

....

Sounds like maybe you could:

use Text::Wrap;

to get more presentable text?

Felix · Feb 23, 2004

I figured out a way to do this. Basically by using the existing code
and converting div tags into <br> tags first.

Thanks for getting back to me.

Marc.

seting cookies to use some links with perl	0	Nov 13, 2007
Weird behavior in IE7 with DIVs wrapped in links	2	Sep 19, 2007
FireFox and Local Links within Generated Content exhibits problems	3	Sep 1, 2005
reading and writing to text file via form newbie	6	Nov 22, 2003
Watir Javascript popup handler: works with popups from links, not from buttons	3	Dec 12, 2005
html-->text, keep line breaks, best strategy is?	2	Dec 17, 2003
ADV - Conversion Services for FrameMaker	0	Jan 10, 2007
HTML Correctness and Validators	7	Dec 29, 2008

html2text but preserving text links

Felix

James Willmore

Bob Walton

Felix

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads