strip all html but links

Discussion in 'Perl Misc' started by Felix Smith, Jan 11, 2004.

  1. Felix Smith

    Felix Smith Guest

    How would you go about removing all html tags from a Web page's source
    code, except for links ? I've been successfully using the function
    below to get rid of *all* html tags. But I need to keep links. Any
    code you can post to help will be much appreciated.

    Felix.

    function I've been using:

    sub html_to_ascii {
    use HTML::TreeBuilder;
    use HTML::FormatText;
    $document = $_[0];
    $html = HTML::TreeBuilder->new();
    $html->parse($document);
    $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
    $return = $formatter->format($html);
    return $return;
    }
     
    Felix Smith, Jan 11, 2004
    #1
    1. Advertising

  2. (Felix Smith) wrote in news:901f024b.0401101704.51858e29
    @posting.google.com:

    > How would you go about removing all html tags from a Web page's source
    > code, except for links?


    See the hanchors example that comes with the HTML::parser module:

    http://search.cpan.org/src/GAAS/HTML-Parser-3.35/eg/


    --
    A. Sinan Unur
    (reverse each component for email address)
     
    A. Sinan Unur, Jan 11, 2004
    #2
    1. Advertising

  3. Felix Smith

    dominix Guest

    Felix Smith wrote:
    > How would you go about removing all html tags from a Web page's source
    > code, except for links ? I've been successfully using the function
    > below to get rid of *all* html tags. But I need to keep links. Any
    > code you can post to help will be much appreciated.
    >
    > Felix.
    >
    > function I've been using:
    >
    > sub html_to_ascii {
    > use HTML::TreeBuilder;
    > use HTML::FormatText;
    > $document = $_[0];
    > $html = HTML::TreeBuilder->new();
    > $html->parse($document);
    > $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
    > $return = $formatter->format($html);
    > return $return;
    > }



    use strict;
    use HTML::TokeParser::Simple;
    my $p = HTML::TokeParser::Simple->new( shift );

    while ( my $token = $p->get_token ) {
    print $token->as_is if $token->is_text;
    print $token->return_attr->{"href"} if $token->is_start_tag( 'a' )
    }

    --
    dominix
     
    dominix, Jan 11, 2004
    #3
  4. Felix Smith

    Felix Guest

    Thanks so much for helping with this. Can you tell me how to change
    the code below so I can use it via a function called, say,
    remove_tags, like this:

    $stripped_content = remove_tags ($content_with tags);

    Thank you very much again!

    "dominix" <dominix@> wrote in message news:<4001015a$0$7143
    >
    > use strict;
    > use HTML::TokeParser::Simple;
    > my $p = HTML::TokeParser::Simple->new( shift );
    >
    > while ( my $token = $p->get_token ) {
    > print $token->as_is if $token->is_text;
    > print $token->return_attr->{"href"} if $token->is_start_tag( 'a' )
    > }
     
    Felix, Jan 11, 2004
    #4
  5. Felix Smith

    dominix Guest

    Felix wrote:
    > Thanks so much for helping with this. Can you tell me how to change
    > the code below so I can use it via a function called, say,
    > remove_tags, like this:
    >
    > $stripped_content = remove_tags ($content_with tags);
    >
    > Thank you very much again!
    >
    > "dominix" <dominix@> wrote in message
    > news:<4001015a$0$7143
    >>
    >> use strict;
    >> use HTML::TokeParser::Simple;
    >> my $p = HTML::TokeParser::Simple->new( shift );
    >>
    >> while ( my $token = $p->get_token ) {
    >> print $token->as_is if $token->is_text;
    >> print $token->return_attr->{"href"} if $token->is_start_tag(
    >> 'a' ) }


    well, try something like (untested)

    use strict;
    use HTML::TokeParser::Simple;

    sub whatever_you_want_the_name{
    my $p = HTML::TokeParser::Simple->new( shift );
    my $result;
    while ( my $token = $p->get_token ) {
    $result .= $token->as_is if $token->is_text;
    $result .= $token->return_attr->{"href"} if $token->is_start_tag(
    'a' )
    }
    return $result
    }
     
    dominix, Jan 11, 2004
    #5
  6. Felix Smith

    Robin Guest

    "Felix Smith" <> wrote in message
    news:...
    > How would you go about removing all html tags from a Web page's source
    > code, except for links ? I've been successfully using the function
    > below to get rid of *all* html tags. But I need to keep links. Any
    > code you can post to help will be much appreciated.


    instead use tr// or s//

    > Felix.
    >
    > function I've been using:
    >
    > sub html_to_ascii {
    > use HTML::TreeBuilder;
    > use HTML::FormatText;
    > $document = $_[0];
    > $html = HTML::TreeBuilder->new();
    > $html->parse($document);
    > $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
    > $return = $formatter->format($html);
    > return $return;
    > }


    that's a little slower than what I mentioned earlier...


    --
    Regards,
    Robin
    --

    --
     
    Robin, Jan 13, 2004
    #6
  7. Felix Smith

    Uri Guttman Guest

    >>>>> "R" == Robin <> writes:

    R> "Felix Smith" <> wrote in message
    R> news:...
    >> How would you go about removing all html tags from a Web page's source
    >> code, except for links ? I've been successfully using the function
    >> below to get rid of *all* html tags. But I need to keep links. Any
    >> code you can post to help will be much appreciated.


    R> instead use tr// or s//

    ok, explain how you can remove any html with tr///?

    and then explain how you can accurately remove html with s///? did you
    read the FAQ on this? NOT!

    >> sub html_to_ascii {
    >> use HTML::TreeBuilder;
    >> use HTML::FormatText;
    >> $document = $_[0];
    >> $html = HTML::TreeBuilder->new();
    >> $html->parse($document);
    >> $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
    >> $return = $formatter->format($html);
    >> return $return;
    >> }


    R> that's a little slower than what I mentioned earlier...

    and a whole lot more accurate. which is better, wrong and fast or slow
    and accurate. remember, your entire programming career is depending on
    your answer. think hard. then rethink what you answered above.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Jan 13, 2004
    #7
  8. Robin wrote:
    > "Felix Smith" <> wrote in message
    > news:...
    >> How would you go about removing all html tags from a Web page's
    >> source code, except for links ? I've been successfully using the
    >> function below to get rid of *all* html tags. But I need to keep
    >> links. Any code you can post to help will be much appreciated.

    >
    > instead use tr// or s//


    How come it doesn't surprise me that such an idiotic advice is coming from
    you?

    No, s// is absolutely not the right tool to parse/deal with HTML.

    And suggesting tr// is just plain ridiculous. Please show me the code to
    remove all HTML tags from a text but links using tr and I will send you a
    100$ gift certificate for Barnes and Nobles, such that you can by yourself
    some nice Perl books.

    jue
     
    Jürgen Exner, Jan 13, 2004
    #8
  9. Felix Smith

    Uri Guttman Guest

    >>>>> "JE" == Jürgen Exner <> writes:

    JE> Robin wrote:
    >>
    >> instead use tr// or s//


    JE> How come it doesn't surprise me that such an idiotic advice is coming from
    JE> you?

    JE> And suggesting tr// is just plain ridiculous. Please show me the code to
    JE> remove all HTML tags from a text but links using tr and I will send you a
    JE> 100$ gift certificate for Barnes and Nobles, such that you can by yourself
    JE> some nice Perl books.

    i will donate to that one. not a great risk :)

    maybe like this:

    <very rough pseudo code>

    while ( $i < length $html ) {
    $char = substr( $html, $i, 1 ) ;

    if ( $char =~ tr/<>// ) {

    $DIETY knows what code
    }
    else {

    $DIETY knows what state
    }
    }

    ain't tr useful!

    :)

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Jan 13, 2004
    #9
  10. Also sprach Uri Guttman:

    >>>>>> "R" == Robin <> writes:

    >
    > R> "Felix Smith" <> wrote in message
    > R> news:...
    > >> How would you go about removing all html tags from a Web page's source
    > >> code, except for links ? I've been successfully using the function
    > >> below to get rid of *all* html tags. But I need to keep links. Any
    > >> code you can post to help will be much appreciated.

    >
    > R> instead use tr// or s//
    >
    > ok, explain how you can remove any html with tr///?


    With a state-machine of course. Tss, Uri, don't you know anything?

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
     
    Tassilo v. Parseval, Jan 13, 2004
    #10
  11. Felix Smith

    Uri Guttman Guest

    >>>>> "TvP" == Tassilo v Parseval <> writes:

    TvP> Also sprach Uri Guttman:
    >>

    R> instead use tr// or s//
    >>
    >> ok, explain how you can remove any html with tr///?


    TvP> With a state-machine of course. Tss, Uri, don't you know anything?

    see my other post :)

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Jan 13, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. java_seek
    Replies:
    4
    Views:
    622
    Andrei Kouznetsov
    Dec 10, 2004
  2. Laphan
    Replies:
    1
    Views:
    111
    Anthony Jones
    Jun 18, 2006
  3. Aquila
    Replies:
    35
    Views:
    454
    Mathieu Bouchard
    Mar 31, 2005
  4. yelipolok
    Replies:
    4
    Views:
    263
    John W. Krahn
    Jan 27, 2010
  5. Max Cuban
    Replies:
    2
    Views:
    105
    Gene Heskett
    Nov 29, 2013
Loading...

Share This Page