FAQ 9.4 How do I remove HTML from a string?

Discussion in 'Perl Misc' started by PerlFAQ Server, Apr 10, 2011.

  1. This is an excerpt from the latest version perlfaq9.pod, which
    comes with the standard Perl distribution. These postings aim to
    reduce the number of repeated questions as well as allow the community
    to review and update the answers. The latest version of the complete
    perlfaq is at http://faq.perl.org .

    --------------------------------------------------------------------

    9.4: How do I remove HTML from a string?

    The most correct way (albeit not the fastest) is to use "HTML::parser"
    from CPAN. Another mostly correct way is to use "HTML::FormatText" which
    not only removes HTML but also attempts to do a little simple formatting
    of the resulting plain text.

    Many folks attempt a simple-minded regular expression approach, like
    "s/<.*?>//g", but that fails in many cases because the tags may continue
    over line breaks, they may contain quoted angle-brackets, or HTML
    comment may be present. Plus, folks forget to convert entities--like
    "&lt;" for example.

    Here's one "simple-minded" approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\g1)*>//gs

    If you want a more complete solution, see the 3-stage striphtml program
    in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz .

    Here are some tricky cases that you should think about when picking a
    solution:

    <IMG SRC = "foo.gif" ALT = "A > B">

    <IMG SRC = "foo.gif"
    ALT = "A > B">

    <!-- <A comment> -->

    <script>if (a<b && a>c)</script>

    <# Just data #>

    <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

    If HTML comments include other tags, those solutions would also break on
    text like this:

    <!-- This section commented out.
    <B>You can't see me!</B>
    -->



    --------------------------------------------------------------------

    The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
    are not necessarily experts in every domain where Perl might show up,
    so please include as much information as possible and relevant in any
    corrections. The perlfaq-workers also don't have access to every
    operating system or platform, so please include relevant details for
    corrections to examples that do not work on particular platforms.
    Working code is greatly appreciated.

    If you'd like to help maintain the perlfaq, see the details in
    perlfaq.pod.
     
    PerlFAQ Server, Apr 10, 2011
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mitchua
    Replies:
    1
    Views:
    7,161
    Ice Demon
    Jul 15, 2003
  2. Simon-Pierre  Jarry
    Replies:
    2
    Views:
    2,417
    Henrik
    Aug 10, 2005
  3. Robert Oschler
    Replies:
    8
    Views:
    779
    Christopher T King
    Jul 31, 2004
  4. Robert Brewer
    Replies:
    0
    Views:
    535
    Robert Brewer
    Jul 25, 2004
  5. tshad
    Replies:
    6
    Views:
    21,538
    tshad
    Aug 8, 2006
Loading...

Share This Page