Regex help

L

Lord0

Hi there,

I thought the following regex would be trivial to implement - ha ha.
Basically I have a string which contains html tags. Some of the tags I
want to keep (<em></em>, <p></p> etc), some I want to remove. I got
*this* far but obviously it doesn't work....

#! /usr/bin/perl -w

use strict;

my $string = "<p><table>This is a test</table></p>";

# would expect "<p>This is a test</p>"

# Tags to keep #
$string =~ s/<\/?(?!(p|ul|li|strong|em)).*?>//g;

print $string."\n";


I know how much you lot like regex's ;-) so any help.....

Cheers

Lord0
 
P

Paul Lalli

Lord0 said:
I thought the following regex would be trivial to implement - ha ha.
Basically I have a string which contains html tags.

Stop.

Regular Expressions, while a greatly powerful tool, are NOT the proper
tool to be used for HTML parsing. Please search Google Groups' archive
of this Usenet group for hundreds of previous posts on this topic.

Instead, use a module specifically designed for parsing HTML. Search
CPAN at http://search.cpan.org, for modules such as HTML::parser or
HTML::TokeParser
Some of the tags I
want to keep (<em></em>, <p></p> etc), some I want to remove. I got
*this* far but obviously it doesn't work....

"doesn't work" is the worst of all possible error descriptions. *How*
does it not work? What results do you get? Is the string not modified
at all? Is too much taken out? Not enough?
#! /usr/bin/perl -w

use strict;

my $string = "<p><table>This is a test</table></p>";

# would expect "<p>This is a test</p>"

# Tags to keep #
$string =~ s/<\/?(?!(p|ul|li|strong|em)).*?>//g;

Don't make things more difficult for yourself by using / as a delimeter
when you know you will be searching for / within the regexp itself

I *think* your problem here is that the /? outside the look-ahead is
insufficient. When Perl gets to the final tag - </p>, it matches a <,
skips the optional /, then checks to see that the next thing is not one
of those five strings. The next thing is a /, so that check passes.
It then matches everything up to the next >, and replaces the whole
thing with nothing.'

Moving the check for the optional / to inside the lookahead:
s{<(?!/?(p|ul|li|strong|em)).*?>}{}g;
seems to do what you want, but I still recommend abandoning this
approach in favor of using the correct tool for the job - an HTML
parsing module.

Paul Lalli
 
L

Lord0

Regular Expressions, while a greatly powerful tool, are NOT the proper
tool to be used for HTML parsing. Please search Google Groups' archive
of this Usenet group for hundreds of previous posts on this topic.
Instead, use a module specifically designed for parsing HTML. Search
CPAN at http://search.cpan.org, for modules such as HTML::parser or
HTML::TokeParser

A few points here:

1) I had considered an HTML parsing module but for the limited
requirement this seemed unnecessary.
2) I know how to search CPAN but given 1) above chose not to
"doesn't work" is the worst of all possible error descriptions.

I agree - sorry
Don't make things more difficult for yourself by using / as a delimeter
when you know you will be searching for / within the regexp itself

Er.....it wasn't making it more difficult. And yes before you refer me
to Chapter 2 "Roll your own Quotes" in Programming Perl I have read it
Moving the check for the optional / to inside the lookahead:
s{<(?!/?(p|ul|li|strong|em)).*?>}{}g;
seems to do what you want, but I still recommend abandoning this
approach in favor of using the correct tool for the job - an HTML
parsing module.

Excellent! The answer I was looking for. I'm curious as to why I would
want to use an HTML parsing module which no doubt would use multiple
regexes when this "one liner" works?
TMTOWTDI after all....
 
S

Stephen Hildrey

Lord0 said:
Excellent! The answer I was looking for. I'm curious as to why I would
want to use an HTML parsing module which no doubt would use multiple
regexes when this "one liner" works?
TMTOWTDI after all....

There's more than one way to shoot yourself in the foot, too.

Though the one-liner works now, who's to say it will still work in
future when somebody changes the HTML? I love regular expressions as
much as the next Perl coder, but there are times when JWZ's words are true:

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions'. Now they have two problems."

HTML parsing is usually one of them :)

Regards,
Steve
 
P

Paul Lalli

Lord0 said:
A few points here:

1) I had considered an HTML parsing module but for the limited
requirement this seemed unnecessary.

Sure, for the tiny piece of simplistic HTML you posted, this solution
is sufficient.
Er.....it wasn't making it more difficult. And yes before you refer me
to Chapter 2 "Roll your own Quotes" in Programming Perl I have read it

It's more difficult to read. Extraneous \/ sequences always, at least
for me, decrease readability. If it's harder to read, it's harder to
debug and harder to maintain.
Excellent! The answer I was looking for. I'm curious as to why I would
want to use an HTML parsing module which no doubt would use multiple
regexes when this "one liner" works?
TMTOWTDI after all....

Yes, and "just because there's more than one way to do it, does not
mean that all ways are equally cool" or even equally correct. As I
said above, this solution will work for simplistic HTML. But how about
we change your string to something like:
my $string = q{<p><img alt="<CLICK HERE>" />This is a test</p>};
$string =~ s{<(?!/?(p|ul|li|strong|em)).*?>}{}g;

Your regex solution will not handle this case. An HTML Parser would.
Of course, you could then modify your regex to watch out for situations
like this, but then you're going to go into a never-ending cycle of
finding new cases for which it doesn't work, and growing your regex
larger and larger and uglier. It's better to start with the right tool
for the job than it is to waste time editing the wrong tool until it
works perfectly.

Paul Lalli
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top