Let's say I have something like this:
$var = "<font background='#F5F5F5'>Here is some <font
color='#DADADA'>text</font>. Cool, huh?</font>";
I want to remove <font background='#F5F5F5'> and it's matching </
font>, but not the nested font tags.
The nesting can be handled via regex recursion (Perl 5.10 and above)
if you can live with an attribute = (?:"[^<]*?"|'[^<]*?') scenario.
It can still be handled if you can't live with "[^<]*?".
This requires a different strategy of evaluating attr/val in the
loop body upon a sucessful match with the
<(?:font(\s+(?
?:".*?")|(?:'.*?')|(?:[^>]*?))+)\s*(\/?))>
expression, which is guaranteed not to overrun the next markup.
It simply just stores the position or not.
See the below code.
A change scheme with regex might be faster than a tree since all
thats being done is sparce matching with mild validation parsing.
Depends on what you are willing to live with.
If you take out the debug stuff, its really not much code.
-sln
----------------
use strict;
use warnings;
## OP:
## "I want to remove <font background='#F5F5F5'> and it's
## matching </font>, but not the nested font tags."
##
my $debug = 1; # level: 0, 1 or 2
my $xml=<<EOXML;
<data>
start
<font background='#F5F5F5'>
Here is some
<font a>
<font color='#A5A5A5' background='#BABABA'/>
<font background='#DADADA'>
text
<font/>
</font>
Cool,
<font color='#F5F5F5'>
huh?
<font b>
italics
<!--
<font background='#CFCFCF'>
in comment
</font>
-->
<font background='#EFEFEF'>
more
</font>
</font>
</font>
</font>
<font/>
</font>
end
</data>
EOXML
##
my $attr = 'background';
my $open_attr = q{<font\s+[^>]*?(?<=\s)}.$attr.q{\s*=\s*(?:"[^<]*?"|'[^<]*?')[^>]*?(?<!\/)>};
my $close_attr = q{<font\s+[^>]*?(?<=\s)}.$attr.q{\s*=\s*(?:"[^<]*?"|'[^<]*?')[^>]*?\s*\/>};
my $open = q{<font\s*[^>]*?(?<!\/)>};
my $close = q{<\/font\s*>};
my $regx = qr/
(<!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>) #1
|
($close_attr) #2
|
( #3
(?: ($open_attr) | $open ) #4
( #5
(?:
(?>
(?:
(?:<!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>)
| (?! $open | $close ) .
)+
)
| (?3)
)*
)
($close) #6
)
/xs;
##
my @cleartag;
while ( $xml =~ /$regx/ig )
{
if (defined $1) {
print "---->\$1 = '$1'\n" if $debug > 1;
pos($xml) = $+[1];
}
elsif (defined $2) {
push @cleartag, [$-[2], length $2];
print "---->\$2 = '$2'\n" if $debug > 1;
pos($xml) = $+[2];
}
else {
if (defined $4) {
push @cleartag, [$-[4], length $4];
push @cleartag, [$-[6], length $6];
print "---->\$4 = '$4'\n" if $debug > 1;
print "---->\$6 = '$6'\n" if $debug > 1;
}
pos($xml) = $-[5];
}
}
if (@cleartag)
{
print "\n--- OLD ------------\n$xml\n\n" if $debug;
for my $ref ( sort {$b->[0]<=>$a->[0]} @cleartag )
{
print "offset= $ref->[0], length= $ref->[1]\n" if $debug > 1;
substr $xml, $ref->[0], $ref->[1], ($debug > 1 ? '-' x $ref->[1] : "");
}
print "\n--- NEW (", (@cleartag/2),") -------\n$xml\n\n" if $debug;
}
else {
print "No changes made!\n";
}
print "---------\nDone!\n";
__END__
Output:
--- OLD ------------
<data>
start
<font background='#F5F5F5'>
Here is some
<font a>
<font color='#A5A5A5' background='#BABABA'/>
<font background='#DADADA'>
text
<font/>
</font>
Cool,
<font color='#F5F5F5'>
huh?
<font b>
italics
<!--
<font background='#CFCFCF'>
in comment
</font>
-->
<font background='#EFEFEF'>
more
</font>
</font>
</font>
</font>
<font/>
</font>
end
</data>
--- NEW (3.5) -------
<data>
start
Here is some
<font a>
text
<font/>
Cool,
<font color='#F5F5F5'>
huh?
<font b>
italics
<!--
<font background='#CFCFCF'>
in comment
</font>
-->
more
</font>
</font>
</font>
<font/>
end
</data>