strip all html but links

Felix Smith · Jan 11, 2004

How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

A. Sinan Unur · Jan 11, 2004

(e-mail address removed) (Felix Smith) wrote in @posting.google.com:

How would you go about removing all html tags from a Web page's source
code, except for links?

See the hanchors example that comes with the HTML:

arser module:

http://search.cpan.org/src/GAAS/HTML-Parser-3.35/eg/

dominix · Jan 11, 2004

Felix said:
How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

use strict;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( shift );

while ( my $token = $p->get_token ) {
print $token->as_is if $token->is_text;
print $token->return_attr->{"href"} if $token->is_start_tag( 'a' )
}

Felix · Jan 11, 2004

Thanks so much for helping with this. Can you tell me how to change
the code below so I can use it via a function called, say,
remove_tags, like this:

$stripped_content = remove_tags ($content_with tags);

Thank you very much again!

dominix · Jan 11, 2004

Felix said:
Thanks so much for helping with this. Can you tell me how to change
the code below so I can use it via a function called, say,
remove_tags, like this:

$stripped_content = remove_tags ($content_with tags);

Thank you very much again!

well, try something like (untested)

use strict;
use HTML::TokeParser::Simple;

sub whatever_you_want_the_name{
my $p = HTML::TokeParser::Simple->new( shift );
my $result;
while ( my $token = $p->get_token ) {
$result .= $token->as_is if $token->is_text;
$result .= $token->return_attr->{"href"} if $token->is_start_tag(
'a' )
}
return $result
}

Robin · Jan 13, 2004

Felix Smith said:
How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

instead use tr// or s//

Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

that's a little slower than what I mentioned earlier...

Uri Guttman · Jan 13, 2004

R> instead use tr// or s//

ok, explain how you can remove any html with tr///?

and then explain how you can accurately remove html with s///? did you
read the FAQ on this? NOT!

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

Click to expand...

R> that's a little slower than what I mentioned earlier...

and a whole lot more accurate. which is better, wrong and fast or slow
and accurate. remember, your entire programming career is depending on
your answer. think hard. then rethink what you answered above.

uri

Jürgen Exner · Jan 13, 2004

Robin said:
instead use tr// or s//

How come it doesn't surprise me that such an idiotic advice is coming from
you?

No, s// is absolutely not the right tool to parse/deal with HTML.

And suggesting tr// is just plain ridiculous. Please show me the code to
remove all HTML tags from a text but links using tr and I will send you a
100$ gift certificate for Barnes and Nobles, such that you can by yourself
some nice Perl books.

jue

Uri Guttman · Jan 13, 2004

JE> How come it doesn't surprise me that such an idiotic advice is coming from
JE> you?

JE> And suggesting tr// is just plain ridiculous. Please show me the code to
JE> remove all HTML tags from a text but links using tr and I will send you a
JE> 100$ gift certificate for Barnes and Nobles, such that you can by yourself
JE> some nice Perl books.

i will donate to that one. not a great risk

maybe like this:

<very rough pseudo code>

while ( $i < length $html ) {
$char = substr( $html, $i, 1 ) ;

if ( $char =~ tr/<>// ) {

$DIETY knows what code
}
else {

$DIETY knows what state
}
}

ain't tr useful!

uri

Tassilo v. Parseval · Jan 13, 2004

Also sprach Uri Guttman:

R> instead use tr// or s//

ok, explain how you can remove any html with tr///?

With a state-machine of course. Tss, Uri, don't you know anything?

Tassilo

Uri Guttman · Jan 13, 2004

TvP> Also sprach Uri Guttman:
TvP> With a state-machine of course. Tss, Uri, don't you know anything?

see my other post

uri

seting cookies to use some links with perl	0	Nov 13, 2007
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Problem parsing HTML	7	Nov 24, 2009
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Image upload not working in browser	4	Sep 9, 2022
Finding all the links in a Unix file/directory path	3	May 12, 2009
strip given HTML tags	1	Sep 24, 2003

strip all html but links

Felix Smith

A. Sinan Unur

dominix

Felix

dominix

Robin

Uri Guttman

Jürgen Exner

Uri Guttman

Tassilo v. Parseval

Uri Guttman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads