Regex help

Lord0 · Oct 26, 2005

Hi there,

I thought the following regex would be trivial to implement - ha ha.
Basically I have a string which contains html tags. Some of the tags I
want to keep (<em></em>, <p></p> etc), some I want to remove. I got
*this* far but obviously it doesn't work....

#! /usr/bin/perl -w

use strict;

my $string = "<p><table>This is a test</table></p>";

# would expect "<p>This is a test</p>"

# Tags to keep #
$string =~ s/<\/?(?!(p|ul|li|strong|em)).*?>//g;

print $string."\n";

I know how much you lot like regex's ;-) so any help.....

Cheers

Lord0

Paul Lalli · Oct 26, 2005

Lord0 said:
I thought the following regex would be trivial to implement - ha ha.
Basically I have a string which contains html tags.

Stop.

Regular Expressions, while a greatly powerful tool, are NOT the proper
tool to be used for HTML parsing. Please search Google Groups' archive
of this Usenet group for hundreds of previous posts on this topic.

Instead, use a module specifically designed for parsing HTML. Search
CPAN at http://search.cpan.org, for modules such as HTML:

arser or
HTML::TokeParser

Some of the tags I
want to keep (<em></em>, <p></p> etc), some I want to remove. I got
*this* far but obviously it doesn't work....

"doesn't work" is the worst of all possible error descriptions. *How*
does it not work? What results do you get? Is the string not modified
at all? Is too much taken out? Not enough?

#! /usr/bin/perl -w

use strict;

my $string = "<p><table>This is a test</table></p>";

# would expect "<p>This is a test</p>"

# Tags to keep #
$string =~ s/<\/?(?!(p|ul|li|strong|em)).*?>//g;

Don't make things more difficult for yourself by using / as a delimeter
when you know you will be searching for / within the regexp itself

I *think* your problem here is that the /? outside the look-ahead is
insufficient. When Perl gets to the final tag - </p>, it matches a <,
skips the optional /, then checks to see that the next thing is not one
of those five strings. The next thing is a /, so that check passes.
It then matches everything up to the next >, and replaces the whole
thing with nothing.'

Moving the check for the optional / to inside the lookahead:
s{<(?!/?(p|ul|li|strong|em)).*?>}{}g;
seems to do what you want, but I still recommend abandoning this
approach in favor of using the correct tool for the job - an HTML
parsing module.

Paul Lalli

Lord0 · Oct 26, 2005

Regular Expressions, while a greatly powerful tool, are NOT the proper

tool to be used for HTML parsing. Please search Google Groups' archive
of this Usenet group for hundreds of previous posts on this topic.

Instead, use a module specifically designed for parsing HTML. Search
CPAN at http://search.cpan.org, for modules such as HTML:arser or
HTML::TokeParser

A few points here:

1) I had considered an HTML parsing module but for the limited
requirement this seemed unnecessary.
2) I know how to search CPAN but given 1) above chose not to

"doesn't work" is the worst of all possible error descriptions.

I agree - sorry

Don't make things more difficult for yourself by using / as a delimeter
when you know you will be searching for / within the regexp itself

Er.....it wasn't making it more difficult. And yes before you refer me
to Chapter 2 "Roll your own Quotes" in Programming Perl I have read it

Moving the check for the optional / to inside the lookahead:
s{<(?!/?(p|ul|li|strong|em)).*?>}{}g;
seems to do what you want, but I still recommend abandoning this
approach in favor of using the correct tool for the job - an HTML
parsing module.

Excellent! The answer I was looking for. I'm curious as to why I would
want to use an HTML parsing module which no doubt would use multiple
regexes when this "one liner" works?
TMTOWTDI after all....

Stephen Hildrey · Oct 26, 2005

Lord0 said:
Excellent! The answer I was looking for. I'm curious as to why I would
want to use an HTML parsing module which no doubt would use multiple
regexes when this "one liner" works?
TMTOWTDI after all....

There's more than one way to shoot yourself in the foot, too.

Though the one-liner works now, who's to say it will still work in
future when somebody changes the HTML? I love regular expressions as
much as the next Perl coder, but there are times when JWZ's words are true:

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions'. Now they have two problems."

HTML parsing is usually one of them

Regards,
Steve

Paul Lalli · Oct 26, 2005

Lord0 said:
A few points here:

1) I had considered an HTML parsing module but for the limited
requirement this seemed unnecessary.

Sure, for the tiny piece of simplistic HTML you posted, this solution
is sufficient.

Er.....it wasn't making it more difficult. And yes before you refer me
to Chapter 2 "Roll your own Quotes" in Programming Perl I have read it

It's more difficult to read. Extraneous \/ sequences always, at least
for me, decrease readability. If it's harder to read, it's harder to
debug and harder to maintain.

Excellent! The answer I was looking for. I'm curious as to why I would
want to use an HTML parsing module which no doubt would use multiple
regexes when this "one liner" works?
TMTOWTDI after all....

Yes, and "just because there's more than one way to do it, does not
mean that all ways are equally cool" or even equally correct. As I
said above, this solution will work for simplistic HTML. But how about
we change your string to something like:
my $string = q{<p><img alt="<CLICK HERE>" />This is a test</p>};
$string =~ s{<(?!/?(p|ul|li|strong|em)).*?>}{}g;

Your regex solution will not handle this case. An HTML Parser would.
Of course, you could then modify your regex to watch out for situations
like this, but then you're going to go into a never-ending cycle of
finding new cases for which it doesn't work, and growing your regex
larger and larger and uglier. It's better to start with the right tool
for the job than it is to waste time editing the wrong tool until it
works perfectly.

Paul Lalli

I need help fixing my website	2	Oct 15, 2023
I need help making an html website	2	Aug 2, 2023
Creating a regex to get multiple values and print	0	Jan 10, 2021
Help with code	0	Jun 12, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Need help with code on website (noob)	2	Jul 18, 2022
Help with my responsive home page	2	Dec 14, 2022
Regex help	2	Sep 3, 2010

Regex help

Lord0

Paul Lalli

Lord0

Stephen Hildrey

Paul Lalli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads