regex for stripping HTML

M

Michael Vilain

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/
 
K

Koncept

Michael Vilain said:
Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/

Hello. This is from the Terminal Query:

$ perldoc -q html

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage
striphtml
program in http://www.cpan.org/authors/Tom_Chris-
tiansen/scripts/striphtml.gz .
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

Nah, it's not that hard. There's a learning curve, sure, but you'll get
to the top of it in time.

First, you are correct about the "<" -- no need to escape it; whoever did
it wasn't thinking.

Second, it helps to translate the regex sub-expressions into English
(assuming English is your native tongue):

<.*> means: Match a less-than character, followed by as many
characters as possible, followed by a greather-than character.

<[^>]+> means: Match a less-than character, followed by as many non-
greater-than characters as possible, followed by a greater-than
character.

See the difference? . matches ANY character; [^>] matches only non-">"
characters.


Note that it is not possible in general to process HTML via regular
expressions (at least, not simple regexes). Consider the following
snippet of valid HTML:

<img src="foo.jpg" alt='<<<"cool!">>>' />

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
bHd4Hbx8MdXgkJm3sWoUu0K1
=ADWR
-----END PGP SIGNATURE-----
 
D

DOV LEVENGLICK

you have to escape < because it can be used as a search delimiter

Michael Vilain said:
Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/
 
A

Anno Siegel

DOV LEVENGLICK said:
"Michael Vilain " wrote:

[DOV's top-posting re-arranged]
$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

you have to escape < because it can be used as a search delimiter

This is nonsense. What are you talking about? And don't top-post.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top