pattern match problem

L

Lex

Hi, I'm stuck with a pattern match thing.

What I actually want a script to do is the following:

look for <pre> and </pre> and erase all the <br> that you find within it, no
matter what you find. However: leave the rest! ( linebreaks etc.)

But I don't know how to do it properly.

I tried doing this:

in stead of:

----------------------------------------------------------------------------
----
Code
----------------------------------------------------------------------------
----

$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

----------------------------------------------------------------------------
----


I tried:
----------------------------------------------------------------------------
----
Code
----------------------------------------------------------------------------
----

$rec{'Text'} =~ s%<pre>((.|\n)*?)<br>((.|\n)*?)</pre>%<pre>$1 $2</pre>%gim;

----------------------------------------------------------------------------
----


But that would just erase everything between <pre> and </pre> in the next
example:
----------------------------------------------------------------------------
----
Code
----------------------------------------------------------------------------
----

<br><b>Medische reden WAO-uitkering, in percentages</b>
<br><pre>
Turken Marokkanen Nederlanders
<br>Klachten aan het bewegingsapparaat 36 35 36
<br>Psychische klachten 23 26 27
<br>Overig 41 39 37
<br></pre>

----------------------------------------------------------------------------
----


(still studying 'programming perl')

If anybody has a good suggestion...
Thanks for your time reading this.

Lex
 
B

Brian McCauley

Lex said:
Hi, I'm stuck with a pattern match thing.

What I actually want a script to do is the following:

look for <pre> and </pre> and erase all the <br> that you find within it, no
matter what you find. However: leave the rest! ( linebreaks etc.)

But I don't know how to do it properly.

Use an HTML parser module.
I tried doing this:

$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

Do not attempt to process HTML using just regex - it simply isn't
worth the effort.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
G

Gunnar Hjalmarsson

Brian said:
Use an HTML parser module.


Do not attempt to process HTML using just regex - it simply isn't
worth the effort.

That's too categoric IMO. This problem appears to be rather limited,
and under certain conditions, the OP's need may well be served through
something like this:

$rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
(my $rest = $1) =~ s/<br.*?>//gis;
$rest
}egis;

To the OP: Please study the FAQ:

perldoc -q "remove HTML"

and consider whether using the s/// operator like above would be
'safe' enough for your case.
 
M

Matt Garrish

Michal Wojciechowski said:
[...]
look for <pre> and </pre> and erase all the <br> that you find
within it, no matter what you find. However: leave the rest! (
linebreaks etc.)
[...]

$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

The above would work, if it could match overlapping occurrences. One
solution is to use it in a loop, like:

while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

Two quick things: you want foreach not while, and pre and break tags can
include style definitions etc., so best to check for <br[^>]*>.

foreach (s!<pre[^>]*>(.*?)<br[^>]*>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

I'd give my vote to Gunnar's method, though, as you could wind up doing many
passes over the file this way before you clear them all out (though what
<br> tags are doing inside <pre> tags eludes me at the moment).

Matt
 
J

Joe Smith

Matt said:
[...]

look for <pre> and </pre> and erase all the <br> that you find
within it, no matter what you find. However: leave the rest! (
linebreaks etc.)
[...]


$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

The above would work, if it could match overlapping occurrences. One
solution is to use it in a loop, like:

while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

Two quick things: you want foreach not while, and pre and break tags can
include style definitions etc., so best to check for <br[^>]*>.

No, foreach() will remove only the first <br>, not all of them.

The code below prints partial results so that you can see the
loop's actions.

unix% cat temp.pl
$string = "<pre>foo<br>bar<br>baz<br>xyzzy<br>quux</pre>";

$_ = $string;
while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) { print "Part:$_\n";}
print "End while(): $_\n";

$_ = $string;
print "Part:$_\n" while s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig;
print "End 1 while: $_\n";

$_ = $string;
foreach (s!<pre[^>]*>(.*?)<br[^>]*>(.*?)</pre>!<pre>$1 $2</pre>!sig) { print
"Part:$_\n";}
print "End foreach: $_\n";

unix% perl temp.pl
Part:<pre>foo bar<br>baz<br>xyzzy<br>quux</pre>
Part:<pre>foo bar baz<br>xyzzy<br>quux</pre>
Part:<pre>foo bar baz xyzzy<br>quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
End while(): <pre>foo bar baz xyzzy quux</pre>
Part:<pre>foo bar<br>baz<br>xyzzy<br>quux</pre>
Part:<pre>foo bar baz<br>xyzzy<br>quux</pre>
Part:<pre>foo bar baz xyzzy<br>quux</pre>
Part:<pre>foo bar baz xyzzy quux</pre>
End 1 while: <pre>foo bar baz xyzzy quux</pre>
Part:1
End foreach: <pre>foo bar<br>baz<br>xyzzy<br>quux</pre>

-Joe
 
L

Lex

This problem appears to be rather limited,
and under certain conditions, the OP's need may well be served through
something like this:

$rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
(my $rest = $1) =~ s/<br.*?>//gis;
$rest
}egis;

To the OP: Please study the FAQ:

perldoc -q "remove HTML"

and consider whether using the s/// operator like above would be
'safe' enough for your case.

Thanks a lot Gunnar!
It works like a charm.
It would be safe enough for my case as there is nothing more than <br> tags
within the <pre> and </pre> tags. I've got control over that, it's not
parsing just any html file you see. Just pieces of text from a database.

Lex
 
G

Gunnar Hjalmarsson

Lex said:
Thanks a lot Gunnar!
It works like a charm.
Good.

It would be safe enough for my case as there is nothing more than
<br> tags within the <pre> and </pre> tags. I've got control over
that, it's not parsing just any html file you see. Just pieces of
text from a database.

That's what I suspected.
 
M

Matt Garrish

Joe Smith said:
Matt said:
[...]


look for <pre> and </pre> and erase all the <br> that you find
within it, no matter what you find. However: leave the rest! (
linebreaks etc.)

[...]


$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

The above would work, if it could match overlapping occurrences. One
solution is to use it in a loop, like:

while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

Two quick things: you want foreach not while, and pre and break tags can
include style definitions etc., so best to check for <br[^>]*>.

No, foreach() will remove only the first <br>, not all of them.

Ugh, that was just bad on my part, especially since I knew he wanted
multiple passes to clear them out. I ran it with <br /> tags before
modifying the expression, which is why it looked like it wasn't working (I
was just going to make mention of the html formatting, because there's even
less of a point in using a regex if you aren't going to make sure you
capture as many oddities as you can).

Matt
 
L

Lex

Matt Garrish said:
Ugh, that was just bad on my part, especially since I knew he wanted
multiple passes to clear them out. I ran it with <br /> tags before
modifying the expression, which is why it looked like it wasn't working (I
was just going to make mention of the html formatting, because there's even
less of a point in using a regex if you aren't going to make sure you
capture as many oddities as you can).
Well, I know for sure they'll be just <br> and nothing else, they're put
there by my script earlier, replacing \n in plain text you see...

But I'e got it working thanks to you all.

Lex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top