HTML regex challenge

Max Metral · Jul 24, 2004

I'm matching some ASP.net code with some perl regex's to do localization.
I'm having some trouble with asp's embedded use of <% %> and differentiating
it from the html tag... So, the thing I'm matching is like:

<tag a=b c="d">stuff</tag>

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

As expected, it stops after %>. Question is, how can I modify the
expression to still get the whole "attribute section" in that single
match... I've tried various back reference constructs, but they don't seem
to do it. The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"...

Hrmph,
--Max

Bob Walton · Jul 24, 2004

Max said:
I'm matching some ASP.net code with some perl regex's to do localization.
I'm having some trouble with asp's embedded use of <% %> and differentiating
it from the html tag... So, the thing I'm matching is like:

<tag a=b c="d">stuff</tag>

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

As expected, it stops after %>. Question is, how can I modify the
expression to still get the whole "attribute section" in that single
match... I've tried various back reference constructs, but they don't seem
to do it. The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"... ....

--Max

Well, there's really only one way to do it right: Parse the HTML.
There are *bunches* of other cases that can bite you besides the one you
found, and, in general, it is most difficult to handle them all,
particularly in a single regexp. Actually, it is probably difficult to
even know about them all. See:

perldoc HTML:

arser
perldoc -q HTML

The latter document has a few of the possible trip-ups listed.

Tad McClellan · Jul 24, 2004

Max Metral said:
Subject: HTML regex challenge

Parsing arbitrary HTML with a regex is nearly impossible.

You need a Real Parser that knows the HTML grammar.

The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"...

Your problem description will not do the Right Thing for this HTML:

<img src="cool.jpg" alt=">>Cool pic!<<">

after you fix the regex for that case, post it here and we
will show some other HTML that breaks it.

Then after you fix the regex for _that_ case, post the regex
and we'll do it again.

Lather, rinse, repeat.

We can keep that up longer than you can.

Max Metral · Jul 25, 2004

Understood. To argue my case only slightly more, I'm not parsing arbitrary
html, I'm looking for a single tag called "localize" which I the replace the
contents of with the contents of an XML entry from a resource file. So
there's never a case where > appears in an attribute of that tag, UNLESS
it's inside an ASP block (<% %>). The attributes of the localize tag are
very restricted, true/false type things, except for the fact that somebody
may need to "bind" one of these true/falses to a functon call.

So my latest is:
<localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>

which fixes my original problem, but it's true that that won't handle

<localize visible="<%# x > 5%>">foo</localize>

but that seems fixable and "final", in that that's the only case that could
occur given the allowable values of the tag...

The problem with most HTML parsers is that (shocker) they don't handle
ASP.Net (which isn't HTML)... So rather than modding something big I was
hoping to keep it simple, even if that means constraining the user of the
tag somewhat.

ko · Jul 25, 2004

Max said:
Understood. To argue my case only slightly more, I'm not parsing arbitrary
html, I'm looking for a single tag called "localize" which I the replace the
contents of with the contents of an XML entry from a resource file. So
there's never a case where > appears in an attribute of that tag, UNLESS
it's inside an ASP block (<% %>). The attributes of the localize tag are
very restricted, true/false type things, except for the fact that somebody
may need to "bind" one of these true/falses to a functon call.

So my latest is:
<localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>

which fixes my original problem, but it's true that that won't handle

<localize visible="<%# x > 5%>">foo</localize>

but that seems fixable and "final", in that that's the only case that could
occur given the allowable values of the tag...

Generally, you can't argue against the advice to use a HTML module to
parse HTML.

If you *really* want to use a regex, (now that you have elaborated
you're looking for a single, *specific* instance) there are modules on
CPAN that make this job easier, one being Regexp::Common:

use strict;
use warnings;
use Regexp::Common qw /balanced/;

my $text = q[<localize visible="<%# x > 5%>">foo</localize>];
(my $changed = $text) =~
s/$RE{balanced}{-begin => '<%'}{-end => '%>'}{-keep}
/changed text/x;
print $changed . "\n";

Also, when replying to someone please keep the content you quote at the
top and your reply on the bottom. Your reply to Tad is an example of
top-posting, which is covered in the group's posting guidelines
available here:

http://mail.augustmail.com/~tadmc/clpmisc.shtml

HTH - keith

Tore Aursand · Jul 25, 2004

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

Hint: Think right-left, not left-right.

Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Regex challenge	15	Jun 4, 2008
regex problem	6	Jan 3, 2007
regex problem	7	Jun 12, 2009
FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?	0	Jan 27, 2011
Regex: deleting non-matching words	3	Aug 22, 2010
How to debug a regex with (?DEFINE)?	0	Aug 7, 2012
Challenge: Extract episode descriptions.	6	Jan 19, 2008

HTML regex challenge

Max Metral

Bob Walton

Tad McClellan

Max Metral

ko

Tore Aursand

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads