regex for stripping HTML

Discussion in 'Perl' started by Michael Vilain, Oct 28, 2003.

  1. Originally, I was using

    $value =~ s/<.*>//g;

    to strip HTML tags from a variable. It actually stripped everything
    from the first "<" to the last ">" after the ending tag. I found this
    regex in this group:

    $value =~ s/\<[^\<]+\>//g;

    and I'm trying to parse it out and figure out why it works. First off,
    some questions:

    - why escape the "<"? It's not one of the meta characters that has
    special meaning in a regex.

    - what's the difference between using ".*" to match any string and "+"
    to match a repeat of the character class "[^\<]".

    Just trying to deepen my understanding of regex. It's like whitewash --
    it gets more opaque with multiple coats.

    TIA,

    /MeV/

    --
    DeeDee, don't press that button! DeeDee! NO! Dee...
    Michael Vilain, Oct 28, 2003
    #1
    1. Advertising

  2. Michael Vilain

    Koncept Guest

    In article <>,
    Michael Vilain <> wrote:

    > Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    > to strip HTML tags from a variable. It actually stripped everything
    > from the first "<" to the last ">" after the ending tag. I found this
    > regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    > and I'm trying to parse it out and figure out why it works. First off,
    > some questions:
    >
    > - why escape the "<"? It's not one of the meta characters that has
    > special meaning in a regex.
    >
    > - what's the difference between using ".*" to match any string and "+"
    > to match a repeat of the character class "[^\<]".
    >
    > Just trying to deepen my understanding of regex. It's like whitewash --
    > it gets more opaque with multiple coats.
    >
    > TIA,
    >
    > /MeV/


    Hello. This is from the Terminal Query:

    $ perldoc -q html

    --
    Koncept <<
    "Contrary to popular belief, the most dangerous animal is not the lion or
    tiger or even the elephant. The most dangerous animal is a shark riding
    on an elephant, just trampling and eating everything they see." - Jack Handey
    Koncept, Oct 28, 2003
    #2
    1. Advertising

  3. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    "Michael Vilain <>" wrote in news:vilain-
    :

    > Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    > to strip HTML tags from a variable. It actually stripped everything
    > from the first "<" to the last ">" after the ending tag. I found this
    > regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    > and I'm trying to parse it out and figure out why it works. First off,
    > some questions:
    >
    > - why escape the "<"? It's not one of the meta characters that has
    > special meaning in a regex.
    >
    > - what's the difference between using ".*" to match any string and "+"
    > to match a repeat of the character class "[^\<]".
    >
    > Just trying to deepen my understanding of regex. It's like whitewash

    --
    > it gets more opaque with multiple coats.


    Nah, it's not that hard. There's a learning curve, sure, but you'll get
    to the top of it in time.

    First, you are correct about the "<" -- no need to escape it; whoever did
    it wasn't thinking.

    Second, it helps to translate the regex sub-expressions into English
    (assuming English is your native tongue):

    <.*> means: Match a less-than character, followed by as many
    characters as possible, followed by a greather-than character.

    <[^>]+> means: Match a less-than character, followed by as many non-
    greater-than characters as possible, followed by a greater-than
    character.

    See the difference? . matches ANY character; [^>] matches only non-">"
    characters.


    Note that it is not possible in general to process HTML via regular
    expressions (at least, not simple regexes). Consider the following
    snippet of valid HTML:

    <img src="foo.jpg" alt='<<<"cool!">>>' />

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
    bHd4Hbx8MdXgkJm3sWoUu0K1
    =ADWR
    -----END PGP SIGNATURE-----
    Eric J. Roode, Oct 29, 2003
    #3
  4. you have to escape < because it can be used as a search delimiter

    "Michael Vilain " wrote:

    >Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    >to strip HTML tags from a variable. It actually stripped everything
    >from the first "<" to the last ">" after the ending tag. I found this
    >regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    >and I'm trying to parse it out and figure out why it works. First off,
    >some questions:
    >
    >- why escape the "<"? It's not one of the meta characters that has
    >special meaning in a regex.
    >
    >- what's the difference between using ".*" to match any string and "+"
    >to match a repeat of the character class "[^\<]".
    >
    >Just trying to deepen my understanding of regex. It's like whitewash --
    >it gets more opaque with multiple coats.
    >
    >TIA,
    >
    >/MeV/
    >
    >
    >


    --
    Regards,
    Dov Levenglick
    DOV LEVENGLICK, Oct 30, 2003
    #4
  5. Michael Vilain

    Anno Siegel Guest

    DOV LEVENGLICK <> wrote in comp.lang.perl.misc:
    > "Michael Vilain " wrote:


    [DOV's top-posting re-arranged]

    > > $value =~ s/\<[^\<]+\>//g;
    > >
    > >and I'm trying to parse it out and figure out why it works. First off,
    > >some questions:
    > >
    > >- why escape the "<"? It's not one of the meta characters that has
    > >special meaning in a regex.

    >
    > you have to escape < because it can be used as a search delimiter


    This is nonsense. What are you talking about? And don't top-post.

    Anno
    Anno Siegel, Oct 30, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Spondishy

    Stripping html tags from text

    Spondishy, Mar 6, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    4,146
    m.posseth
    Mar 7, 2006
  2. JJ Harrison

    Stripping HTML attributes and tags

    JJ Harrison, Nov 27, 2005, in forum: HTML
    Replies:
    5
    Views:
    1,306
    Toby Inkster
    Nov 28, 2005
  3. christek
    Replies:
    1
    Views:
    323
    Gordon Beaton
    Jan 31, 2007
  4. Replies:
    3
    Views:
    732
    Reedick, Andrew
    Jul 1, 2008
  5. Michael Vilain

    regex for stripping HTML

    Michael Vilain, Oct 28, 2003, in forum: Perl Misc
    Replies:
    8
    Views:
    106
    Alan J. Flavell
    Oct 30, 2003
Loading...

Share This Page