strip all html but links

F

Felix Smith

How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}
 
D

dominix

Felix said:
How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}


use strict;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( shift );

while ( my $token = $p->get_token ) {
print $token->as_is if $token->is_text;
print $token->return_attr->{"href"} if $token->is_start_tag( 'a' )
}
 
F

Felix

Thanks so much for helping with this. Can you tell me how to change
the code below so I can use it via a function called, say,
remove_tags, like this:

$stripped_content = remove_tags ($content_with tags);

Thank you very much again!
 
D

dominix

Felix said:
Thanks so much for helping with this. Can you tell me how to change
the code below so I can use it via a function called, say,
remove_tags, like this:

$stripped_content = remove_tags ($content_with tags);

Thank you very much again!

well, try something like (untested)

use strict;
use HTML::TokeParser::Simple;

sub whatever_you_want_the_name{
my $p = HTML::TokeParser::Simple->new( shift );
my $result;
while ( my $token = $p->get_token ) {
$result .= $token->as_is if $token->is_text;
$result .= $token->return_attr->{"href"} if $token->is_start_tag(
'a' )
}
return $result
}
 
R

Robin

Felix Smith said:
How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

instead use tr// or s//
Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

that's a little slower than what I mentioned earlier...
 
U

Uri Guttman

R> instead use tr// or s//

ok, explain how you can remove any html with tr///?

and then explain how you can accurately remove html with s///? did you
read the FAQ on this? NOT!
sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

R> that's a little slower than what I mentioned earlier...

and a whole lot more accurate. which is better, wrong and fast or slow
and accurate. remember, your entire programming career is depending on
your answer. think hard. then rethink what you answered above.

uri
 
J

Jürgen Exner

Robin said:
instead use tr// or s//

How come it doesn't surprise me that such an idiotic advice is coming from
you?

No, s// is absolutely not the right tool to parse/deal with HTML.

And suggesting tr// is just plain ridiculous. Please show me the code to
remove all HTML tags from a text but links using tr and I will send you a
100$ gift certificate for Barnes and Nobles, such that you can by yourself
some nice Perl books.

jue
 
U

Uri Guttman

JE> How come it doesn't surprise me that such an idiotic advice is coming from
JE> you?

JE> And suggesting tr// is just plain ridiculous. Please show me the code to
JE> remove all HTML tags from a text but links using tr and I will send you a
JE> 100$ gift certificate for Barnes and Nobles, such that you can by yourself
JE> some nice Perl books.

i will donate to that one. not a great risk :)

maybe like this:

<very rough pseudo code>

while ( $i < length $html ) {
$char = substr( $html, $i, 1 ) ;

if ( $char =~ tr/<>// ) {

$DIETY knows what code
}
else {

$DIETY knows what state
}
}

ain't tr useful!

:)

uri
 
T

Tassilo v. Parseval

Also sprach Uri Guttman:
R> instead use tr// or s//

ok, explain how you can remove any html with tr///?

With a state-machine of course. Tss, Uri, don't you know anything?

Tassilo
 
U

Uri Guttman

TvP> Also sprach Uri Guttman:
TvP> With a state-machine of course. Tss, Uri, don't you know anything?

see my other post :)

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top