regex for stripping HTML

Discussion in 'Perl Misc' started by Michael Vilain, Oct 28, 2003.

  1. Originally, I was using

    $value =~ s/<.*>//g;

    to strip HTML tags from a variable. It actually stripped everything
    from the first "<" to the last ">" after the ending tag. I found this
    regex in this group:

    $value =~ s/\<[^\<]+\>//g;

    and I'm trying to parse it out and figure out why it works. First off,
    some questions:

    - why escape the "<"? It's not one of the meta characters that has
    special meaning in a regex.

    - what's the difference between using ".*" to match any string and "+"
    to match a repeat of the character class "[^\<]".

    Just trying to deepen my understanding of regex. It's like whitewash --
    it gets more opaque with multiple coats.

    TIA,

    /MeV/

    --
    DeeDee, don't press that button! DeeDee! NO! Dee...
     
    Michael Vilain, Oct 28, 2003
    #1
    1. Advertising

  2. [not sent to the defunct newsgroup comp.lang.perl]

    "Michael Vilain " wrote:
    > I found this regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;


    Then you had some bad luck. ;-)

    This makes sense under certain conditions:

    $value =~ s/<[^>]*>//g;

    But normally you are recommended to use a module instead for parsing
    HTML markup.

    > - why escape the "<"? It's not one of the meta characters that has
    > special meaning in a regex.


    You are correct.

    > - what's the difference between using ".*" to match any string and
    > "+" to match a repeat of the character class "[^\<]".


    Please study the Perl documentation for regular expressions, for instance:

    http://www.perldoc.com/perl5.8.0/pod/perlretut.html

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Oct 28, 2003
    #2
    1. Advertising

  3. Michael Vilain

    Matija Papec Guest

    X-Ftn-To: Michael Vilain <>

    "Michael Vilain <>" wrote:
    >Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    >to strip HTML tags from a variable. It actually stripped everything
    >from the first "<" to the last ">" after the ending tag. I found this
    >regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    >and I'm trying to parse it out and figure out why it works. First off,
    >some questions:
    >
    >- why escape the "<"? It's not one of the meta characters that has
    >special meaning in a regex.
    >
    >- what's the difference between using ".*" to match any string and "+"
    >to match a repeat of the character class "[^\<]".


    /<.*>/g matches everything between first "<" and last ">". There should be
    "?" after "*" to make regex ungreedy.

    /<[^<]+>/g matches everything except "<" between "<" and next ">"



    --
    Matija
     
    Matija Papec, Oct 28, 2003
    #3
  4. Michael Vilain

    Ben Morrow Guest

    Gunnar Hjalmarsson <> wrote:
    > But normally you are recommended to use a module instead for parsing
    > HTML markup.


    Or, say, read perldoc -q HTML :).

    Ben

    --
    "The Earth is degenerating these days. Bribery and corruption abound.
    Children no longer mind their parents, every man wants to write a book,
    and it is evident that the end of the world is fast approaching."
    -Assyrian stone tablet, c.2800 BC
     
    Ben Morrow, Oct 28, 2003
    #4
  5. Michael Vilain

    Koncept Guest

    In article <>,
    Michael Vilain <> wrote:

    > Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    > to strip HTML tags from a variable. It actually stripped everything
    > from the first "<" to the last ">" after the ending tag. I found this
    > regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    > and I'm trying to parse it out and figure out why it works. First off,
    > some questions:
    >
    > - why escape the "<"? It's not one of the meta characters that has
    > special meaning in a regex.
    >
    > - what's the difference between using ".*" to match any string and "+"
    > to match a repeat of the character class "[^\<]".
    >
    > Just trying to deepen my understanding of regex. It's like whitewash --
    > it gets more opaque with multiple coats.
    >
    > TIA,
    >
    > /MeV/


    Hello. This is from the Terminal Query:

    $ perldoc -q html

    --
    Koncept <<
    "Contrary to popular belief, the most dangerous animal is not the lion or
    tiger or even the elephant. The most dangerous animal is a shark riding
    on an elephant, just trampling and eating everything they see." - Jack Handey
     
    Koncept, Oct 28, 2003
    #5
  6. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    "Michael Vilain <>" wrote in news:vilain-
    :

    > Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    > to strip HTML tags from a variable. It actually stripped everything
    > from the first "<" to the last ">" after the ending tag. I found this
    > regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    > and I'm trying to parse it out and figure out why it works. First off,
    > some questions:
    >
    > - why escape the "<"? It's not one of the meta characters that has
    > special meaning in a regex.
    >
    > - what's the difference between using ".*" to match any string and "+"
    > to match a repeat of the character class "[^\<]".
    >
    > Just trying to deepen my understanding of regex. It's like whitewash

    --
    > it gets more opaque with multiple coats.


    Nah, it's not that hard. There's a learning curve, sure, but you'll get
    to the top of it in time.

    First, you are correct about the "<" -- no need to escape it; whoever did
    it wasn't thinking.

    Second, it helps to translate the regex sub-expressions into English
    (assuming English is your native tongue):

    <.*> means: Match a less-than character, followed by as many
    characters as possible, followed by a greather-than character.

    <[^>]+> means: Match a less-than character, followed by as many non-
    greater-than characters as possible, followed by a greater-than
    character.

    See the difference? . matches ANY character; [^>] matches only non-">"
    characters.


    Note that it is not possible in general to process HTML via regular
    expressions (at least, not simple regexes). Consider the following
    snippet of valid HTML:

    <img src="foo.jpg" alt='<<<"cool!">>>' />

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
    bHd4Hbx8MdXgkJm3sWoUu0K1
    =ADWR
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Oct 29, 2003
    #6
  7. you have to escape < because it can be used as a search delimiter

    "Michael Vilain " wrote:

    >Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    >to strip HTML tags from a variable. It actually stripped everything
    >from the first "<" to the last ">" after the ending tag. I found this
    >regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    >and I'm trying to parse it out and figure out why it works. First off,
    >some questions:
    >
    >- why escape the "<"? It's not one of the meta characters that has
    >special meaning in a regex.
    >
    >- what's the difference between using ".*" to match any string and "+"
    >to match a repeat of the character class "[^\<]".
    >
    >Just trying to deepen my understanding of regex. It's like whitewash --
    >it gets more opaque with multiple coats.
    >
    >TIA,
    >
    >/MeV/
    >
    >
    >


    --
    Regards,
    Dov Levenglick
     
    DOV LEVENGLICK, Oct 30, 2003
    #7
  8. Michael Vilain

    Anno Siegel Guest

    DOV LEVENGLICK <> wrote in comp.lang.perl.misc:
    > "Michael Vilain " wrote:


    [DOV's top-posting re-arranged]

    > > $value =~ s/\<[^\<]+\>//g;
    > >
    > >and I'm trying to parse it out and figure out why it works. First off,
    > >some questions:
    > >
    > >- why escape the "<"? It's not one of the meta characters that has
    > >special meaning in a regex.

    >
    > you have to escape < because it can be used as a search delimiter


    This is nonsense. What are you talking about? And don't top-post.

    Anno
     
    Anno Siegel, Oct 30, 2003
    #8
  9. On Thu, 30 Oct 2003, DOV LEVENGLICK ...

    Bogosity alerts:

    1:
    Content-Type: multipart/alternative;
    boundary="------------030500060107020504030609"

    2: TOFU-posting

    3: cross-posted without further comment to a dead newsgroup
    comp.lang.perl

    and need I mention the SHOUTED PERSONAL NAME?

    > ... wrote:
    >
    > you have to escape < because it can be used as a search delimiter


    Well, Q.E.D.

    I suppose it's wasted effort to suggest you might get a grasp on your
    material and the conventions of your chosen forum -before- stepping up
    to the plate to offer answers to technical questions?

    If you had been _asking_ a question, then such behaviour *might*
    just be a tad[1] more excusable.

    [1] No pun intended.
     
    Alan J. Flavell, Oct 30, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Vilain

    regex for stripping HTML

    Michael Vilain, Oct 28, 2003, in forum: Perl
    Replies:
    4
    Views:
    671
    Anno Siegel
    Oct 30, 2003
  2. Spondishy

    Stripping html tags from text

    Spondishy, Mar 6, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    4,173
    m.posseth
    Mar 7, 2006
  3. JJ Harrison

    Stripping HTML attributes and tags

    JJ Harrison, Nov 27, 2005, in forum: HTML
    Replies:
    5
    Views:
    1,338
    Toby Inkster
    Nov 28, 2005
  4. christek
    Replies:
    1
    Views:
    335
    Gordon Beaton
    Jan 31, 2007
  5. Replies:
    3
    Views:
    798
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page