HTML regex challenge

M

Max Metral

I'm matching some ASP.net code with some perl regex's to do localization.
I'm having some trouble with asp's embedded use of <% %> and differentiating
it from the html tag... So, the thing I'm matching is like:

<tag a=b c="d">stuff</tag>

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

As expected, it stops after %>. Question is, how can I modify the
expression to still get the whole "attribute section" in that single
match... I've tried various back reference constructs, but they don't seem
to do it. The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"...

Hrmph,
--Max
 
B

Bob Walton

Max said:
I'm matching some ASP.net code with some perl regex's to do localization.
I'm having some trouble with asp's embedded use of <% %> and differentiating
it from the html tag... So, the thing I'm matching is like:

<tag a=b c="d">stuff</tag>

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

As expected, it stops after %>. Question is, how can I modify the
expression to still get the whole "attribute section" in that single
match... I've tried various back reference constructs, but they don't seem
to do it. The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"... ....


--Max

Well, there's really only one way to do it right: Parse the HTML.
There are *bunches* of other cases that can bite you besides the one you
found, and, in general, it is most difficult to handle them all,
particularly in a single regexp. Actually, it is probably difficult to
even know about them all. See:

perldoc HTML::parser
perldoc -q HTML

The latter document has a few of the possible trip-ups listed.
 
T

Tad McClellan

Max Metral said:
Subject: HTML regex challenge


Parsing arbitrary HTML with a regex is nearly impossible.

You need a Real Parser that knows the HTML grammar.

The expression fragment I want is "match everything except right
bracket, unless there was a % before the right bracket"...


Your problem description will not do the Right Thing for this HTML:

<img src="cool.jpg" alt=">>Cool pic!<<">

after you fix the regex for that case, post it here and we
will show some other HTML that breaks it.

Then after you fix the regex for _that_ case, post the regex
and we'll do it again.

Lather, rinse, repeat.

We can keep that up longer than you can. :)
 
M

Max Metral

Understood. To argue my case only slightly more, I'm not parsing arbitrary
html, I'm looking for a single tag called "localize" which I the replace the
contents of with the contents of an XML entry from a resource file. So
there's never a case where > appears in an attribute of that tag, UNLESS
it's inside an ASP block (<% %>). The attributes of the localize tag are
very restricted, true/false type things, except for the fact that somebody
may need to "bind" one of these true/falses to a functon call.

So my latest is:
<localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>

which fixes my original problem, but it's true that that won't handle

<localize visible="<%# x > 5%>">foo</localize>

but that seems fixable and "final", in that that's the only case that could
occur given the allowable values of the tag...

The problem with most HTML parsers is that (shocker) they don't handle
ASP.Net (which isn't HTML)... So rather than modding something big I was
hoping to keep it simple, even if that means constraining the user of the
tag somewhat.
 
K

ko

Max said:
Understood. To argue my case only slightly more, I'm not parsing arbitrary
html, I'm looking for a single tag called "localize" which I the replace the
contents of with the contents of an XML entry from a resource file. So
there's never a case where > appears in an attribute of that tag, UNLESS
it's inside an ASP block (<% %>). The attributes of the localize tag are
very restricted, true/false type things, except for the fact that somebody
may need to "bind" one of these true/falses to a functon call.

So my latest is:
<localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>

which fixes my original problem, but it's true that that won't handle

<localize visible="<%# x > 5%>">foo</localize>

but that seems fixable and "final", in that that's the only case that could
occur given the allowable values of the tag...

Generally, you can't argue against the advice to use a HTML module to
parse HTML.

If you *really* want to use a regex, (now that you have elaborated
you're looking for a single, *specific* instance) there are modules on
CPAN that make this job easier, one being Regexp::Common:

use strict;
use warnings;
use Regexp::Common qw /balanced/;

my $text = q[<localize visible="<%# x > 5%>">foo</localize>];
(my $changed = $text) =~
s/$RE{balanced}{-begin => '<%'}{-end => '%>'}{-keep}
/changed text/x;
print $changed . "\n";

Also, when replying to someone please keep the content you quote at the
top and your reply on the bottom. Your reply to Tad is an example of
top-posting, which is covered in the group's posting guidelines
available here:

http://mail.augustmail.com/~tadmc/clpmisc.shtml

HTH - keith
 
T

Tore Aursand

My reg ex is:

<tag([^>]*?)>(.*?)</tag>

Which works fine for the first example. But it doesn't for this:

<tag a=b c="<%foo%>">stuff</tag>

Hint: Think right-left, not left-right.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top