Removing tag + closing tag

J

jwcarlton

Let's say I have something like this:

$var = "<font background='#F5F5F5'>Here is some <font
color='#DADADA'>text</font>. Cool, huh?</font>";

I want to remove <font background='#F5F5F5'> and it's matching </
font>, but not the nested font tags. I'm removing the first part like
this:

$var =~ s/<font background='(.*?)'>//gi;

How would I remove the matching </font>, without removing the one for
<font color='#DADADA'>?
 
J

Jürgen Exner

jwcarlton said:
Let's say I have something like this:

$var = "<font background='#F5F5F5'>Here is some <font
color='#DADADA'>text</font>. Cool, huh?</font>";

I want to remove <font background='#F5F5F5'> and it's matching </
font>, but not the nested font tags. I'm removing the first part like
this:

$var =~ s/<font background='(.*?)'>//gi;

How would I remove the matching </font>, without removing the one for
<font color='#DADADA'>?

Simple. By using a propper HTML parser (or writing your own).

For further details please see the numerous previous discussions about
this ever-popular, constantly returning, eternal topic.

jue
 
J

jwcarlton

Simple. By using a propper HTML parser (or writing your own).

For further details please see the numerous previous discussions about
this ever-popular, constantly returning, eternal topic.

jue

I searched for this info before posting, but didn't find anything.
Unfortunately, wading through gallons of spam makes it hard to find
anything.

Can you point me in a better direction? Hints on a subject name,
maybe?

Tad, I guess you got me! LOL TECHNICALLY, that works on my sample,
but doesn't quite solve the overall problem.
 
J

Jürgen Exner

jwcarlton said:
I searched for this info before posting, but didn't find anything.
Unfortunately, wading through gallons of spam makes it hard to find
anything.

Can you point me in a better direction?

Well, the key word here is Chomsky hierarchy. You cannot parse a context
free language like HTML using a regular (=finite state) automaton.
Granted, Perl's REs have extensions which make them more powerful than
normal regular expressions, but still it is A Very Bad Idea(TM) trying
to parse a language as complex as HTML using Perl's REs.
Hints on a subject name, maybe?

How about some ponters to the FAQ instead: perldoc -q HTML
- How do I match XML, HTML, or other nasty, ugly things with a regex?
- How do I remove HTML from a string?

jue
 
J

jwcarlton

My (secondary) point was that you have not taken care to make a
good sample.


I don't intend to solve your problems.

I don't normally see any of your posts.  I was "slumming" down in
the killfiled score range (I was bored).

You make your reputation, and then you live with it.

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Reputation? To my knowledge, I have no enemies here.

Methinks you may just be filtering people that use Google Groups. If
so, no skin off my back; others did a grand job answering, so all you
really contributed was unnecessary BS.
 
S

sln

Let's say I have something like this:

$var = "<font background='#F5F5F5'>Here is some <font
color='#DADADA'>text</font>. Cool, huh?</font>";

I want to remove <font background='#F5F5F5'> and it's matching </
font>, but not the nested font tags.

The nesting can be handled via regex recursion (Perl 5.10 and above)
if you can live with an attribute = (?:"[^<]*?"|'[^<]*?') scenario.

It can still be handled if you can't live with "[^<]*?".
This requires a different strategy of evaluating attr/val in the
loop body upon a sucessful match with the
<(?:font(\s+(?:(?:".*?")|(?:'.*?')|(?:[^>]*?))+)\s*(\/?))>
expression, which is guaranteed not to overrun the next markup.
It simply just stores the position or not.
See the below code.

A change scheme with regex might be faster than a tree since all
thats being done is sparce matching with mild validation parsing.
Depends on what you are willing to live with.
If you take out the debug stuff, its really not much code.

-sln
----------------
use strict;
use warnings;

## OP:
## "I want to remove <font background='#F5F5F5'> and it's
## matching </font>, but not the nested font tags."

##
my $debug = 1; # level: 0, 1 or 2

my $xml=<<EOXML;
<data>
start
<font background='#F5F5F5'>
Here is some
<font a>
<font color='#A5A5A5' background='#BABABA'/>
<font background='#DADADA'>
text
<font/>
</font>
Cool,
<font color='#F5F5F5'>
huh?
<font b>
italics
<!--
<font background='#CFCFCF'>
in comment
</font>
-->
<font background='#EFEFEF'>
more
</font>
</font>
</font>
</font>
<font/>
</font>
end
</data>
EOXML

##
my $attr = 'background';
my $open_attr = q{<font\s+[^>]*?(?<=\s)}.$attr.q{\s*=\s*(?:"[^<]*?"|'[^<]*?')[^>]*?(?<!\/)>};
my $close_attr = q{<font\s+[^>]*?(?<=\s)}.$attr.q{\s*=\s*(?:"[^<]*?"|'[^<]*?')[^>]*?\s*\/>};
my $open = q{<font\s*[^>]*?(?<!\/)>};
my $close = q{<\/font\s*>};

my $regx = qr/

(<!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>) #1
|
($close_attr) #2
|
( #3
(?: ($open_attr) | $open ) #4
( #5
(?:
(?>
(?:
(?:<!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>)
| (?! $open | $close ) .
)+
)
| (?3)
)*
)
($close) #6
)
/xs;

##
my @cleartag;

while ( $xml =~ /$regx/ig )
{
if (defined $1) {
print "---->\$1 = '$1'\n" if $debug > 1;
pos($xml) = $+[1];
}
elsif (defined $2) {
push @cleartag, [$-[2], length $2];
print "---->\$2 = '$2'\n" if $debug > 1;
pos($xml) = $+[2];
}
else {
if (defined $4) {
push @cleartag, [$-[4], length $4];
push @cleartag, [$-[6], length $6];
print "---->\$4 = '$4'\n" if $debug > 1;
print "---->\$6 = '$6'\n" if $debug > 1;
}
pos($xml) = $-[5];
}
}

if (@cleartag)
{
print "\n--- OLD ------------\n$xml\n\n" if $debug;
for my $ref ( sort {$b->[0]<=>$a->[0]} @cleartag )
{
print "offset= $ref->[0], length= $ref->[1]\n" if $debug > 1;
substr $xml, $ref->[0], $ref->[1], ($debug > 1 ? '-' x $ref->[1] : "");
}
print "\n--- NEW (", (@cleartag/2),") -------\n$xml\n\n" if $debug;
}
else {
print "No changes made!\n";
}
print "---------\nDone!\n";

__END__

Output:

--- OLD ------------
<data>
start
<font background='#F5F5F5'>
Here is some
<font a>
<font color='#A5A5A5' background='#BABABA'/>
<font background='#DADADA'>
text
<font/>
</font>
Cool,
<font color='#F5F5F5'>
huh?
<font b>
italics
<!--
<font background='#CFCFCF'>
in comment
</font>
-->
<font background='#EFEFEF'>
more
</font>
</font>
</font>
</font>
<font/>
</font>
end
</data>



--- NEW (3.5) -------
<data>
start

Here is some
<font a>


text
<font/>

Cool,
<font color='#F5F5F5'>
huh?
<font b>
italics
<!--
<font background='#CFCFCF'>
in comment
</font>
-->

more

</font>
</font>
</font>
<font/>

end
</data>
 
T

Theo van den Heuvel

The nesting can be handled via regex recursion (Perl 5.10 and above) .... Loads of regex madness
Done!

The OP is strongly recommended to follow the advice that is posted
here every week and use an existing HTML parser instead of doing
something that can be mathematically proven to be impossible unless
for fairly trivial cases. Sln's approach only indicates how convoluted
and vulnerable the regex attempts need to be. They can never scale
when requirements are added.

Theo van den Heuvel
 
S

sln

... use an existing HTML parser instead of doing

Uh, that would be xhtml or xml. Allowing un-closed tags requires
an inside-out nesting strategy when stripping out selected tags.
Its doable.
something that can be mathematically proven to be impossible unless
for fairly trivial cases.

Inpossible? You obviously tried the code. Work for you?
Let me know if it doesen't.
Nothing trivial about this code. I've just shown how this can be done
with rx recursion. Prove this wrong mathematically!
They can never scale when requirements are added.

They? There is nothing "general" about the code. Its specific.
Do you actually know what it does?
You speak of "parser" but don't understand the language.
This is not parsing anything other than balanced text
using the recursion engine of Perl 5.10.
Take it up with Larry if there is a problem.

Lets just say nesting can be handled quite well.

-sln
 
D

David Canzi

The OP is strongly recommended to follow the advice that is posted
here every week and use an existing HTML parser instead of doing
something that can be mathematically proven to be impossible unless
for fairly trivial cases. Sln's approach only indicates how convoluted
and vulnerable the regex attempts need to be. They can never scale
when requirements are added.

Theo van den Heuvel

Perl's regular expressions ceased to be regular expressions long
ago and nobody ever got around to coming up with a different name
for them. Perl's regular expressions have grown into a language
within a language which, if not already capable of emulating a
Turing machine, probably soon will be.

Even when people stick to basic features of regular expressions,
ie. features consistent with equivalence to deterministic finite
automata, regular expressions are hard to understand. When it
becomes possible to write large and complex expressions in this
language within a language, the result is as hard to understand
as APL or dc. And whenever this expression has to be examined
again to debug it or improve it, it will have to re-understood.
Every time it needs to be maintained it has to be *solved*.

Today's complex Perl regexp is tomorrow's maintenance headache.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.
 
S

sln

Today's complex Perl regexp is tomorrow's maintenance headache.
Yet, complex patterns are no more easily handled with other
non-regular expression methods.

Is it time to limit the complexities of life then?
Maybe, take out repetition and variance.
People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

Why, thank you.

-sln
 
C

C.DeRykus

Perl's regular expressions ceased to be regular expressions long
ago and nobody ever got around to coming up with a different name
for them.  Perl's regular expressions have grown into a language
within a language which, if not already capable of emulating a
Turing machine, probably soon will be.

Even when people stick to basic features of regular expressions,
ie. features consistent with equivalence to deterministic finite
automata, regular expressions are hard to understand.  When it
becomes possible to write large and complex expressions in this
language within a language, the result is as hard to understand
as APL or dc.  And whenever this expression has to be examined
again to debug it or improve it, it will have to re-understood.
Every time it needs to be maintained it has to be *solved*.

Today's complex Perl regexp is tomorrow's maintenance headache.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

I think that's overstated and I especially
disagree with the last sentence. This forum
is for discussing/exploring the Perl language -
capabilities, features, even its quirks. While
a real parser is the right answer, Perl regexes
can now match balanced text - a key part of this
particular task. So a demo is on point and could
be useful in some future problem scenario. It's
fair to be impressed with the power of Perl's
regex capabilities.

Of course, there is considerable complexity to
regexes and they can quickly become unwieldy. But
Perl provides and recommends a way for comment them.
They are not appropriate in every case but they
needn't become maintenance headaches.
 
J

Jürgen Exner

Perl's regular expressions ceased to be regular expressions long
ago and nobody ever got around to coming up with a different name
for them. Perl's regular expressions have grown into a language
within a language which, if not already capable of emulating a
Turing machine, probably soon will be.

Even when people stick to basic features of regular expressions,
ie. features consistent with equivalence to deterministic finite
automata, regular expressions are hard to understand. When it
becomes possible to write large and complex expressions in this
language within a language, the result is as hard to understand
as APL or dc. And whenever this expression has to be examined
again to debug it or improve it, it will have to re-understood.
Every time it needs to be maintained it has to be *solved*.

Today's complex Perl regexp is tomorrow's maintenance headache.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

Amen to that!!!

jue
 
S

sln

Just for fun, here's a not-quite-complete but otherwise correct (modulo
bugs, obviously) implementation of an XML parser as a single regex (I
omitted PIs and DTDs, since they added about as many productions again).
It's not terribly useful as it stands (it just tells you if a given
string contains a valid XML document or not) but it could be made to
build a parse tree fairly easily using (?{}) (subject to the usual
caveats with that construction).

Thanks for this Ben.
I like the way you put this together, following the naming of
the w3c xml 1.1 recommendations document
http://www.w3.org/TR/xml11/#NT-AttValue
There are a few missing things but all in all, its pretty good.

I tried it out and discovered some issues using recursion on it,
with the xml standard goals.
I'll just note the issues on the sections, then have a full demo
of your regex, modified (below). The focus is on elements and nesting mostly.
Also included is a proof of concept (the small sample code), of element
nesting and recursion.

I haven't checked out everything, but it looks fairly conformative to the
xml 1.1 recommendations. Many test cases would have to be developed.
But, in any case, the Perl 5.10 engine is still lacking, the (?{ code })
is perilous, and I hope to see some improvements in version 6.

How you had the patience to put this together is anybodys guess, but thanks.
Btw, I don't hold out a lot of hope on the speed. I could be wrong, and I will
have to check that.

-sln

------------------------
m(
(?&document)

(?(DEFINE)

# Document

(?<document>
(?&prolog) (?&element) (?&Misc)*

needs to be a check for root content here (see code)
)

# Character sets

(?<Char>
[\x9-\xA\xD\x20-\x7E\x85\xA0-\x{D7FF}] |

[\x9] style doesen't work for my version, [\x{9}] does
[\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]
)
(?<S> [\x20\x9\xD\xA]+ )

[snip Names, Comments]
# Prolog

(?<prolog> (?&XMLDecl) (?&Misc)* )
(?<XMLDecl>
<\?xml (?&VersionInfo) (?&EncodingDecl)? (?&S)? \?>

there needs to be more options in the xml declaration body and
not sure but also there might also be said:
)
(?<Misc> (?&Comment) | (?&S) )
(?<Eq> (?&S)? = (?&S)? )

(?<VersionInfo> (?&S) version (?&Eq) (?: '1\.[10]' | "1\.[10]" ) )

(?<EncodingDecl>
(?&S) encoding (?&Eq) (?: "(?&EncName)" | '(?&EncName)' )
)
(?<EncName> [A-Za-z] (?: [A-Za-z0-9._-] )* )

[snip CDATA]
# Element

(?<element> (?&EmptyElemTag) | (?&STag) (?&content) (?&ETag) )
^^^^
This is a big problem. As it is now, any </tag> end tag can satisfy,
or balance any <tag> start tag. Should there be a *counter* imbalance
later, sucessfull recursive matching will occurr (on the whole).
There is a balance, but only with start and end tags, tag names are not.
This matches:
<A> content </B>

A *better* scheme is to match with a backreference (see code).

The really bad problem is if there is not a sucessfull match in
the code path (?&ETag), the engine will sit in a infinite backtracking loop
recursing the same element. Consider this:
<A> <B> </A> <C> </C>

This can be fixed with backreference or'd with some verbs (*COMMIT)(?{ msg })(*FAIL)
(see code).
(?<STag> < (?&Name) (?: (?&S) (?&Attribute) )* (?&S)? > )
(?<Attribute> (?&Name) (?&Eq) (?&AttValue) )
(?<ETag> </ (?&Name) (?&S)? > )

(?<AttValue>
" (?: [^<&"] | (?&Reference) )* " |
' (?: [^<&'] | (?&Reference) )* '
)
[snip content, empty elements, references]

Test of balanced named tags (proof of concept)
----------------------
use strict;
use warnings;

use re 'eval';
##
my $xml=<<EXML;
<aa> textaa
<bb> textbb
<cc> textbb
</cc>
</bb>
end textaa
</aa>
EXML

##
my $regx = qr/
(?:
( #1
# Start tag
< (?<name> [a-z]+ ) >

(?{ print "our name is $+{name}\n" })

# content
(?:
(?> (?: (?! <[a-z]+> | <\/[a-z]+> ) .)+ )
| (?1)
)*

(?{ print "we need $+{name}\n" })

# End tag
<\/ \k<name> >
(?{ print "have $+{name}\n" })
)
(?{print "\nmatched: \n$^N\n\n" })
|
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top