pattern match problem

Lex · Jun 17, 2004

Hi, I'm stuck with a pattern match thing.

What I actually want a script to do is the following:

look for <pre> and </pre> and erase all the that you find within it, no
matter what you find. However: leave the rest! ( linebreaks etc.)

But I don't know how to do it properly.

I tried doing this:

in stead of:

----------------------------------------------------------------------------
----
Code
----------------------------------------------------------------------------
----

$rec{'Text'} =~ s%<pre>(.*?) (.*?)</pre>%<pre>$1 $2</pre>%gim;

----------------------------------------------------------------------------
----

I tried:
----------------------------------------------------------------------------
----
Code
----------------------------------------------------------------------------
----

$rec{'Text'} =~ s%<pre>((.|\n)*?) ((.|\n)*?)</pre>%<pre>$1 $2</pre>%gim;

----------------------------------------------------------------------------
----

But that would just erase everything between <pre> and </pre> in the next
example:
----------------------------------------------------------------------------
----
Code
----------------------------------------------------------------------------
----

 Medische reden WAO-uitkering, in percentages
 <pre>
Turken Marokkanen Nederlanders
 Klachten aan het bewegingsapparaat 36 35 36
 Psychische klachten 23 26 27
 Overig 41 39 37
 </pre>

----------------------------------------------------------------------------
----

(still studying 'programming perl')

If anybody has a good suggestion...
Thanks for your time reading this.

Lex

Brian McCauley · Jun 17, 2004

Lex said:
Hi, I'm stuck with a pattern match thing.

What I actually want a script to do is the following:

look for <pre> and </pre> and erase all the that you find within it, no
matter what you find. However: leave the rest! ( linebreaks etc.)

But I don't know how to do it properly.

Use an HTML parser module.

I tried doing this:

$rec{'Text'} =~ s%<pre>(.*?) (.*?)</pre>%<pre>$1 $2</pre>%gim;

Do not attempt to process HTML using just regex - it simply isn't
worth the effort.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Gunnar Hjalmarsson · Jun 18, 2004

Brian said:
Use an HTML parser module.

Do not attempt to process HTML using just regex - it simply isn't
worth the effort.

That's too categoric IMO. This problem appears to be rather limited,
and under certain conditions, the OP's need may well be served through
something like this:

$rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
(my $rest = $1) =~ s/<br.*?>//gis;
$rest
}egis;

To the OP: Please study the FAQ:

perldoc -q "remove HTML"

and consider whether using the s/// operator like above would be
'safe' enough for your case.

Matt Garrish · Jun 18, 2004

Michal Wojciechowski said:
[...]

look for <pre> and </pre> and erase all the that you find
within it, no matter what you find. However: leave the rest! (
linebreaks etc.)
[...]

$rec{'Text'} =~ s%<pre>(.*?) (.*?)</pre>%<pre>$1 $2</pre>%gim;

Click to expand...

The above would work, if it could match overlapping occurrences. One
solution is to use it in a loop, like:

while (s!<pre>(.*?) (.*?)</pre>!<pre>$1 $2</pre>!sig) {}

Two quick things: you want foreach not while, and pre and break tags can
include style definitions etc., so best to check for <br[^>]*>.

foreach (s!<pre[^>]*>(.*?)<br[^>]*>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

I'd give my vote to Gunnar's method, though, as you could wind up doing many
passes over the file this way before you clear them all out (though what
 tags are doing inside <pre> tags eludes me at the moment).

Matt

Joe Smith · Jun 18, 2004

Matt said:
[...]

look for <pre> and </pre> and erase all the that you find
within it, no matter what you find. However: leave the rest! (
linebreaks etc.)
[...]

$rec{'Text'} =~ s%<pre>(.*?) (.*?)</pre>%<pre>$1 $2</pre>%gim;

Click to expand...

The above would work, if it could match overlapping occurrences. One
solution is to use it in a loop, like:

while (s!<pre>(.*?) (.*?)</pre>!<pre>$1 $2</pre>!sig) {}

Click to expand...

Two quick things: you want foreach not while, and pre and break tags can
include style definitions etc., so best to check for <br[^>]*>.

No, foreach() will remove only the first , not all of them.

The code below prints partial results so that you can see the
loop's actions.

unix% cat temp.pl
$string = "<pre>foo bar baz xyzzy quux</pre>";

$_ = $string;
while (s!<pre>(.*?) (.*?)</pre>!<pre>$1 $2</pre>!sig) { print "Part:$_\n";}
print "End while(): $_\n";

$_ = $string;
print "Part:$_\n" while s!<pre>(.*?) (.*?)</pre>!<pre>$1 $2</pre>!sig;
print "End 1 while: $_\n";

$_ = $string;
foreach (s!<pre[^>]*>(.*?)<br[^>]*>(.*?)</pre>!<pre>$1 $2</pre>!sig) { print
"Part:$_\n";}
print "End foreach: $_\n";

unix% perl temp.pl
Part:<pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
End while(): <pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
End 1 while: <pre>foo bar baz xyzzy quux</pre>
Part:1
End foreach: <pre>foo bar baz xyzzy quux</pre>

-Joe

Lex · Jun 18, 2004

This problem appears to be rather limited,
and under certain conditions, the OP's need may well be served through
something like this:

$rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
(my $rest = $1) =~ s/<br.*?>//gis;
$rest
}egis;

To the OP: Please study the FAQ:

perldoc -q "remove HTML"

and consider whether using the s/// operator like above would be
'safe' enough for your case.

Thanks a lot Gunnar!
It works like a charm.
It would be safe enough for my case as there is nothing more than tags
within the <pre> and </pre> tags. I've got control over that, it's not
parsing just any html file you see. Just pieces of text from a database.

Lex

Gunnar Hjalmarsson · Jun 18, 2004

Lex said:
Thanks a lot Gunnar!
It works like a charm.
Good.

It would be safe enough for my case as there is nothing more than
 tags within the <pre> and </pre> tags. I've got control over
that, it's not parsing just any html file you see. Just pieces of
text from a database.

That's what I suspected.

Matt Garrish · Jun 18, 2004

Joe Smith said:
Matt said:

[...]

look for <pre> and </pre> and erase all the that you find
within it, no matter what you find. However: leave the rest! (
linebreaks etc.)

[...]

$rec{'Text'} =~ s%<pre>(.*?) (.*?)</pre>%<pre>$1 $2</pre>%gim;

The above would work, if it could match overlapping occurrences. One
solution is to use it in a loop, like:

while (s!<pre>(.*?) (.*?)</pre>!<pre>$1 $2</pre>!sig) {}

Click to expand...

Two quick things: you want foreach not while, and pre and break tags can
include style definitions etc., so best to check for <br[^>]*>.

Click to expand...

No, foreach() will remove only the first , not all of them.

Ugh, that was just bad on my part, especially since I knew he wanted
multiple passes to clear them out. I ran it with tags before
modifying the expression, which is why it looked like it wasn't working (I
was just going to make mention of the html formatting, because there's even
less of a point in using a regex if you aren't going to make sure you
capture as many oddities as you can).

Matt

Lex · Jun 18, 2004

Matt Garrish said:
Ugh, that was just bad on my part, especially since I knew he wanted
multiple passes to clear them out. I ran it with tags before
modifying the expression, which is why it looked like it wasn't working (I
was just going to make mention of the html formatting, because there's even
less of a point in using a regex if you aren't going to make sure you
capture as many oddities as you can).

Well, I know for sure they'll be just and nothing else, they're put
there by my script earlier, replacing \n in plain text you see...

But I'e got it working thanks to you all.

Lex

Issue with textbox script?	0	Sep 5, 2022
Help with code	0	Jun 12, 2022
Pattern match problem	3	Jan 14, 2004
How to find and replace something that is nested inside something else?	4	May 31, 2007
Taskcproblem calendar	4	Aug 31, 2023
FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?	0	Jan 27, 2011
Problem with one perl script executing another, execution started byApache httpd	9	Sep 16, 2008
regex multi-line match/replace issue	6	Apr 24, 2006

pattern match problem

Lex

Brian McCauley

Gunnar Hjalmarsson

Matt Garrish

Joe Smith

Lex

Gunnar Hjalmarsson

Matt Garrish

Lex

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads