Remove all HTML but keep <p> tags

Discussion in 'Perl Misc' started by Rob, Feb 10, 2012.

  1. Rob

    Rob Guest

    I am looking for a perl REGEX statement to remove all the HTML from a
    string except for the <p> tags. It would have to leave the <p> (and
    the </p>) tags but also longer ones such as <p style=...> etc. I
    haven't been able to find anything similar online for this.

    Can anyone help with a suitable REGEX for this? I have tried a few
    things but had no success.

    Any help would be much appreciated.

    Rob
    Rob, Feb 10, 2012
    #1
    1. Advertising

  2. Rob

    J. Gleixner Guest

    On 02/10/12 15:24, Rob wrote:
    > I am looking for a perl REGEX statement to remove all the HTML from a
    > string except for the<p> tags. It would have to leave the<p> (and
    > the</p>) tags but also longer ones such as<p style=...> etc. I
    > haven't been able to find anything similar online for this.
    >
    > Can anyone help with a suitable REGEX for this? I have tried a few
    > things but had no success.
    >
    > Any help would be much appreciated.
    >
    > Rob


    Depending on how complex the 'string' is, you probably want to
    avoid a regular expression solution and use a parser.e.g.
    HTML::parser.

    Read the documentation and take a look a some of the examples
    in the distribution, like hstrip and htext.
    J. Gleixner, Feb 10, 2012
    #2
    1. Advertising

  3. Rob <> wrote:
    >I am looking for a perl REGEX statement to remove all the HTML from a


    Please see the FAQ and the many, many archived posts why HTML and REGEX
    is not a viable combination.

    >string except for the <p> tags. It would have to leave the <p> (and
    >the </p>) tags but also longer ones such as <p style=...> etc. I
    >haven't been able to find anything similar online for this.
    >
    >Can anyone help with a suitable REGEX for this? I have tried a few
    >things but had no success.


    That is not surprising because it cannot be done for arbitrary HTML. For
    further details please read up on the Chomsky hierarchy of languages.

    jue
    Jürgen Exner, Feb 11, 2012
    #3
  4. [As J. Gleixner has already pointed out, there are HTML parsers
    available for perl - doing this with a regexp is almost certainly not
    the best way to do this]


    On 2012-02-11 00:41, Jürgen Exner <> wrote:
    > Rob <> wrote:
    >>I am looking for a perl REGEX statement to remove all the HTML from a

    >
    > Please see the FAQ and the many, many archived posts why HTML and REGEX
    > is not a viable combination.
    >
    >>string except for the <p> tags. It would have to leave the <p> (and
    >>the </p>) tags but also longer ones such as <p style=...> etc. I
    >>haven't been able to find anything similar online for this.


    What exactly do you mean by "remove all html except <p> tags"?

    What would the result of processing the following (simple) file be?


    <html>
    <head>
    <title>
    A test
    </title>
    </head>
    <body>
    <h1> A test </h1> <h2> for Robs script </h2>
    <p>
    The quick brown fox jumps over the lazy dog.
    </p>
    <table>
    <tr>
    <td>
    <p>
    upper left
    </p>
    <p>
    lower left
    </p>
    </td>
    <td>
    <p>
    right
    </p>
    </td>
    </tr>
    </table>
    <!--
    <p>
    This is not a paragraph
    </p>
    -->
    <p>
    Over &amp; out!
    </p>
    </body>
    </html>


    >>Can anyone help with a suitable REGEX for this? I have tried a few
    >>things but had no success.


    Well, what have you tried?

    Some tips:

    * Start with a formal grammar of what you want to match.
    I usually use some form of BNF.
    * Don't try to write the whole Regexp at once. Use one Regexp
    for every production in your grammar and use variable substitution
    to build more complex regexps (there is a parallel thread about
    matching RFC5322 headers with some examples).
    * Use /x and comments.


    > That is not surprising because it cannot be done for arbitrary HTML. For
    > further details please read up on the Chomsky hierarchy of languages.


    Care to explain how the difference between regular and context-free
    grammars is relevant to the task at hand? And you know of course that
    Perl regexps are a superset of regular expressions, so that even if the
    task is impossible with a regular expression, it may still be possible
    with a regexp (has anyone tried to prove that regexps are/are not
    equivalent to context-free grammars lately?).

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Feb 11, 2012
    #4
  5. # try this
    use strict;
    use warnings;
    my $htm=sub{local $/=undef;$_=$_[0];<$_>}->(\*DATA);
    while( $htm =~/<p[^>]*?>(.*?)<\/p>/gi ) {
    print "*$^N*\n"
    }

    __DATA__

    <p>Earth</p> blah1 <p style=...>Sun</p> blah1
    <p style=...>Moon</p> blah2 <p>
    Venus
    </p><p style=...>Hermes</p>blah2<p>
    Jupiter</p>
    George Mpouras, Feb 14, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mitchua
    Replies:
    1
    Views:
    7,061
    Ice Demon
    Jul 15, 2003
  2. jjliu
    Replies:
    5
    Views:
    14,270
    Gunnar Hjalmarsson
    Oct 15, 2003
  3. Rob Nicholson
    Replies:
    3
    Views:
    716
    Rob Nicholson
    May 28, 2005
  4. Lasse Edsvik

    filter html-tags, but not all...

    Lasse Edsvik, Oct 27, 2003, in forum: ASP General
    Replies:
    4
    Views:
    134
    Evertjan.
    Oct 27, 2003
  5. Alexander Paul

    remove HTML tag - keep everything in between

    Alexander Paul, Nov 28, 2006, in forum: Javascript
    Replies:
    2
    Views:
    72
    Alexander Paul
    Nov 28, 2006
Loading...

Share This Page