Removing tag + closing tag

jwcarlton · Sep 20, 2010

Let's say I have something like this:

$var = "Here is some text. Cool, huh?";

I want to remove and it's matching , but not the nested font tags. I'm removing the first part like
this:

$var =~ s///gi;

How would I remove the matching , without removing the one for
?

Jürgen Exner · Sep 20, 2010

jwcarlton said:
Let's say I have something like this:

$var = "Here is some text. Cool, huh?";

I want to remove and it's matching , but not the nested font tags. I'm removing the first part like
this:

$var =~ s///gi;

How would I remove the matching , without removing the one for
?

Simple. By using a propper HTML parser (or writing your own).

For further details please see the numerous previous discussions about
this ever-popular, constantly returning, eternal topic.

jue

jwcarlton · Sep 20, 2010

Simple. By using a propper HTML parser (or writing your own).

For further details please see the numerous previous discussions about
this ever-popular, constantly returning, eternal topic.

jue

I searched for this info before posting, but didn't find anything.
Unfortunately, wading through gallons of spam makes it hard to find
anything.

Can you point me in a better direction? Hints on a subject name,
maybe?

Tad, I guess you got me! LOL TECHNICALLY, that works on my sample,
but doesn't quite solve the overall problem.

Jürgen Exner · Sep 20, 2010

jwcarlton said:
I searched for this info before posting, but didn't find anything.
Unfortunately, wading through gallons of spam makes it hard to find
anything.

Can you point me in a better direction?

Well, the key word here is Chomsky hierarchy. You cannot parse a context
free language like HTML using a regular (=finite state) automaton.
Granted, Perl's REs have extensions which make them more powerful than
normal regular expressions, but still it is A Very Bad Idea(TM) trying
to parse a language as complex as HTML using Perl's REs.

Hints on a subject name, maybe?

How about some ponters to the FAQ instead: perldoc -q HTML
- How do I match XML, HTML, or other nasty, ugly things with a regex?
- How do I remove HTML from a string?

jue

jwcarlton · Sep 22, 2010

My (secondary) point was that you have not taken care to make a
good sample.

I don't intend to solve your problems.

I don't normally see any of your posts. I was "slumming" down in
the killfiled score range (I was bored).

You make your reputation, and then you live with it.

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Reputation? To my knowledge, I have no enemies here.

Methinks you may just be filtering people that use Google Groups. If
so, no skin off my back; others did a grand job answering, so all you
really contributed was unnecessary BS.

sln · Sep 22, 2010

Let's say I have something like this:

$var = "Here is some text. Cool, huh?";

I want to remove and it's matching , but not the nested font tags.

The nesting can be handled via regex recursion (Perl 5.10 and above)
if you can live with an attribute = (?:"[^<]*?"|'[^<]*?') scenario.

It can still be handled if you can't live with "[^<]*?".
This requires a different strategy of evaluating attr/val in the
loop body upon a sucessful match with the
<(?:font(\s+(?

?:".*?")|(?:'.*?')|(?:[^>]*?))+)\s*(\/?))>
expression, which is guaranteed not to overrun the next markup.
It simply just stores the position or not.
See the below code.

A change scheme with regex might be faster than a tree since all
thats being done is sparce matching with mild validation parsing.
Depends on what you are willing to live with.
If you take out the debug stuff, its really not much code.

-sln
----------------
use strict;
use warnings;

## OP:
## "I want to remove and it's
## matching , but not the nested font tags."

##
my $debug = 1; # level: 0, 1 or 2

my $xml=<<EOXML;
<data>
start

Here is some



text


Cool,

huh?

italics


more






end
</data>
EOXML

##
my $attr = 'background';
my $open_attr = q{<font\s+[^>]*?(?<=\s)}.$attr.q{\s*=\s*(?:"[^<]*?"|'[^<]*?')[^>]*?(?<!\/)>};
my $close_attr = q{<font\s+[^>]*?(?<=\s)}.$attr.q{\s*=\s*(?:"[^<]*?"|'[^<]*?')[^>]*?\s*\/>};
my $open = q{<font\s*[^>]*?(?<!\/)>};
my $close = q{<\/font\s*>};

my $regx = qr/

(<!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>) #1
|
($close_attr) #2
|
( #3
(?: ($open_attr) | $open ) #4
( #5
(?:
(?>
(?:
(?:<!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>)
| (?! $open | $close ) .
)+
)
| (?3)
)*
)
($close) #6
)
/xs;

##
my @cleartag;

while ( $xml =~ /$regx/ig )
{
if (defined $1) {
print "---->\$1 = '$1'\n" if $debug > 1;
pos($xml) = $+[1];
}
elsif (defined $2) {
push @cleartag, [$-[2], length $2];
print "---->\$2 = '$2'\n" if $debug > 1;
pos($xml) = $+[2];
}
else {
if (defined $4) {
push @cleartag, [$-[4], length $4];
push @cleartag, [$-[6], length $6];
print "---->\$4 = '$4'\n" if $debug > 1;
print "---->\$6 = '$6'\n" if $debug > 1;
}
pos($xml) = $-[5];
}
}

if (@cleartag)
{
print "\n--- OLD ------------\n$xml\n\n" if $debug;
for my $ref ( sort {$b->[0]<=>$a->[0]} @cleartag )
{
print "offset= $ref->[0], length= $ref->[1]\n" if $debug > 1;
substr $xml, $ref->[0], $ref->[1], ($debug > 1 ? '-' x $ref->[1] : "");
}
print "\n--- NEW (", (@cleartag/2),") -------\n$xml\n\n" if $debug;
}
else {
print "No changes made!\n";
}
print "---------\nDone!\n";

__END__

Output:

--- OLD ------------
<data>
start

Here is some



text


Cool,

huh?

italics


more






end
</data>

--- NEW (3.5) -------
<data>
start

Here is some


text


Cool,

huh?

italics


more






end
</data>

Theo van den Heuvel · Sep 23, 2010

The nesting can be handled via regex recursion (Perl 5.10 and above) .... Loads of regex madness
Done!

The OP is strongly recommended to follow the advice that is posted
here every week and use an existing HTML parser instead of doing
something that can be mathematically proven to be impossible unless
for fairly trivial cases. Sln's approach only indicates how convoluted
and vulnerable the regex attempts need to be. They can never scale
when requirements are added.

Theo van den Heuvel

sln · Sep 23, 2010

... use an existing HTML parser instead of doing

Uh, that would be xhtml or xml. Allowing un-closed tags requires
an inside-out nesting strategy when stripping out selected tags.
Its doable.

something that can be mathematically proven to be impossible unless
for fairly trivial cases.

Inpossible? You obviously tried the code. Work for you?
Let me know if it doesen't.
Nothing trivial about this code. I've just shown how this can be done
with rx recursion. Prove this wrong mathematically!

They can never scale when requirements are added.

They? There is nothing "general" about the code. Its specific.
Do you actually know what it does?
You speak of "parser" but don't understand the language.
This is not parsing anything other than balanced text
using the recursion engine of Perl 5.10.
Take it up with Larry if there is a problem.

Lets just say nesting can be handled quite well.

-sln

David Canzi · Sep 23, 2010

The OP is strongly recommended to follow the advice that is posted
here every week and use an existing HTML parser instead of doing
something that can be mathematically proven to be impossible unless
for fairly trivial cases. Sln's approach only indicates how convoluted
and vulnerable the regex attempts need to be. They can never scale
when requirements are added.

Theo van den Heuvel

Click to expand...

Perl's regular expressions ceased to be regular expressions long
ago and nobody ever got around to coming up with a different name
for them. Perl's regular expressions have grown into a language
within a language which, if not already capable of emulating a
Turing machine, probably soon will be.

Even when people stick to basic features of regular expressions,
ie. features consistent with equivalence to deterministic finite
automata, regular expressions are hard to understand. When it
becomes possible to write large and complex expressions in this
language within a language, the result is as hard to understand
as APL or dc. And whenever this expression has to be examined
again to debug it or improve it, it will have to re-understood.
Every time it needs to be maintained it has to be *solved*.

Today's complex Perl regexp is tomorrow's maintenance headache.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

sln · Sep 23, 2010

Today's complex Perl regexp is tomorrow's maintenance headache.

Yet, complex patterns are no more easily handled with other
non-regular expression methods.

Is it time to limit the complexities of life then?
Maybe, take out repetition and variance.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

Why, thank you.

-sln

C.DeRykus · Sep 23, 2010

Perl's regular expressions ceased to be regular expressions long
ago and nobody ever got around to coming up with a different name
for them. Perl's regular expressions have grown into a language
within a language which, if not already capable of emulating a
Turing machine, probably soon will be.

Even when people stick to basic features of regular expressions,
ie. features consistent with equivalence to deterministic finite
automata, regular expressions are hard to understand. When it
becomes possible to write large and complex expressions in this
language within a language, the result is as hard to understand
as APL or dc. And whenever this expression has to be examined
again to debug it or improve it, it will have to re-understood.
Every time it needs to be maintained it has to be *solved*.

Today's complex Perl regexp is tomorrow's maintenance headache.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

Click to expand...

I think that's overstated and I especially
disagree with the last sentence. This forum
is for discussing/exploring the Perl language -
capabilities, features, even its quirks. While
a real parser is the right answer, Perl regexes
can now match balanced text - a key part of this
particular task. So a demo is on point and could
be useful in some future problem scenario. It's
fair to be impressed with the power of Perl's
regex capabilities.

Of course, there is considerable complexity to
regexes and they can quickly become unwieldy. But
Perl provides and recommends a way for comment them.
They are not appropriate in every case but they
needn't become maintenance headaches.

Jürgen Exner · Sep 23, 2010

Perl's regular expressions ceased to be regular expressions long
ago and nobody ever got around to coming up with a different name
for them. Perl's regular expressions have grown into a language
within a language which, if not already capable of emulating a
Turing machine, probably soon will be.

Even when people stick to basic features of regular expressions,
ie. features consistent with equivalence to deterministic finite
automata, regular expressions are hard to understand. When it
becomes possible to write large and complex expressions in this
language within a language, the result is as hard to understand
as APL or dc. And whenever this expression has to be examined
again to debug it or improve it, it will have to re-understood.
Every time it needs to be maintained it has to be *solved*.

Today's complex Perl regexp is tomorrow's maintenance headache.

People can impress themselves all they want by being Perl regexp
virtuosi, but they shouldn't do it in public.

Amen to that!!!

jue

sln · Sep 26, 2010

Just for fun, here's a not-quite-complete but otherwise correct (modulo
bugs, obviously) implementation of an XML parser as a single regex (I
omitted PIs and DTDs, since they added about as many productions again).
It's not terribly useful as it stands (it just tells you if a given
string contains a valid XML document or not) but it could be made to
build a parse tree fairly easily using (?{}) (subject to the usual
caveats with that construction).

Thanks for this Ben.
I like the way you put this together, following the naming of
the w3c xml 1.1 recommendations document
http://www.w3.org/TR/xml11/#NT-AttValue
There are a few missing things but all in all, its pretty good.

I tried it out and discovered some issues using recursion on it,
with the xml standard goals.
I'll just note the issues on the sections, then have a full demo
of your regex, modified (below). The focus is on elements and nesting mostly.
Also included is a proof of concept (the small sample code), of element
nesting and recursion.

I haven't checked out everything, but it looks fairly conformative to the
xml 1.1 recommendations. Many test cases would have to be developed.
But, in any case, the Perl 5.10 engine is still lacking, the (?{ code })
is perilous, and I hope to see some improvements in version 6.

How you had the patience to put this together is anybodys guess, but thanks.
Btw, I don't hold out a lot of hope on the speed. I could be wrong, and I will
have to check that.

-sln

------------------------

m(
(?&document)

(?(DEFINE)

# Document

(?<document>
(?&prolog) (?&element) (?&Misc)*

needs to be a check for root content here (see code)

)

# Character sets

(?<Char>
[\x9-\xA\xD\x20-\x7E\x85\xA0-\x{D7FF}] |

[\x9] style doesen't work for my version, [\x{9}] does

[\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]
)
(?<S> [\x20\x9\xD\xA]+ )

[snip Names, Comments]

# Prolog

(?<prolog> (?&XMLDecl) (?&Misc)* )
(?<XMLDecl>
<\?xml (?&VersionInfo) (?&EncodingDecl)? (?&S)? \?>

there needs to be more options in the xml declaration body and

not sure but also there might also be said:
)
(?<Misc> (?&Comment) | (?&S) )
(?<Eq> (?&S)? = (?&S)? )

(?<VersionInfo> (?&S) version (?&Eq) (?: '1\.[10]' | "1\.[10]" ) )

(?<EncodingDecl>
(?&S) encoding (?&Eq) (?: "(?&EncName)" | '(?&EncName)' )
)
(?<EncName> [A-Za-z] (?: [A-Za-z0-9._-] )* )

[snip CDATA]

# Element

(?<element> (?&EmptyElemTag) | (?&STag) (?&content) (?&ETag) )

^^^^
This is a big problem. As it is now, any </tag> end tag can satisfy,
or balance any <tag> start tag. Should there be a *counter* imbalance
later, sucessfull recursive matching will occurr (on the whole).
There is a balance, but only with start and end tags, tag names are not.
This matches:
<A> content 

A *better* scheme is to match with a backreference (see code).

The really bad problem is if there is not a sucessfull match in
the code path (?&ETag), the engine will sit in a infinite backtracking loop
recursing the same element. Consider this:
<A> </A> <C> </C>

This can be fixed with backreference or'd with some verbs (*COMMIT)(?{ msg })(*FAIL)
(see code).

(?<STag> < (?&Name) (?: (?&S) (?&Attribute) )* (?&S)? > )
(?<Attribute> (?&Name) (?&Eq) (?&AttValue) )
(?<ETag> </ (?&Name) (?&S)? > )

(?<AttValue>
" (?: [^<&"] | (?&Reference) )* " |
' (?: [^<&'] | (?&Reference) )* '
)

[snip content, empty elements, references]

)xs

Test of balanced named tags (proof of concept)
----------------------
use strict;
use warnings;

use re 'eval';
##
my $xml=<<EXML;
<aa> textaa
<bb> textbb
<cc> textbb
</cc>
</bb>
end textaa
</aa>
EXML

##
my $regx = qr/
(?:
( #1
# Start tag
< (?<name> [a-z]+ ) >

(?{ print "our name is $+{name}\n" })

# content
(?:
(?> (?: (?! <[a-z]+> | <\/[a-z]+> ) .)+ )
| (?1)
)*

(?{ print "we need $+{name}\n" })

# End tag
<\/ \k<name> >
(?{ print "have $+{name}\n" })
)
(?{print "\nmatched: \n$^N\n\n" })
|

Colspan probs	2	May 21, 2026
Need help , Having problem with Drag n Drop ?	9	Dec 21, 2022
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 24, 2022
<Button ...> display is fine, except for two things	1	Oct 23, 2023
Text box simply do not stand out against the wall paper.	3	Feb 7, 2025
Closing an overlay outside the overlay as well	1	Dec 11, 2022
Interfering CSS	1	Feb 9, 2024
Snowing Effect	2	Apr 24, 2023

Removing tag + closing tag

jwcarlton

Jürgen Exner

jwcarlton

Jürgen Exner

jwcarlton

sln

Theo van den Heuvel

sln

David Canzi

sln

C.DeRykus

Jürgen Exner

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads