pattern match problem

Discussion in 'Perl Misc' started by Lex, Jun 17, 2004.

  1. Lex

    Lex Guest

    Hi, I'm stuck with a pattern match thing.

    What I actually want a script to do is the following:

    look for <pre> and </pre> and erase all the <br> that you find within it, no
    matter what you find. However: leave the rest! ( linebreaks etc.)

    But I don't know how to do it properly.

    I tried doing this:

    in stead of:

    ----------------------------------------------------------------------------
    ----
    Code
    ----------------------------------------------------------------------------
    ----

    $rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

    ----------------------------------------------------------------------------
    ----


    I tried:
    ----------------------------------------------------------------------------
    ----
    Code
    ----------------------------------------------------------------------------
    ----

    $rec{'Text'} =~ s%<pre>((.|\n)*?)<br>((.|\n)*?)</pre>%<pre>$1 $2</pre>%gim;

    ----------------------------------------------------------------------------
    ----


    But that would just erase everything between <pre> and </pre> in the next
    example:
    ----------------------------------------------------------------------------
    ----
    Code
    ----------------------------------------------------------------------------
    ----

    <br><b>Medische reden WAO-uitkering, in percentages</b>
    <br><pre>
    Turken Marokkanen Nederlanders
    <br>Klachten aan het bewegingsapparaat 36 35 36
    <br>Psychische klachten 23 26 27
    <br>Overig 41 39 37
    <br></pre>

    ----------------------------------------------------------------------------
    ----


    (still studying 'programming perl')

    If anybody has a good suggestion...
    Thanks for your time reading this.

    Lex
     
    Lex, Jun 17, 2004
    #1
    1. Advertising

  2. "Lex" <> writes:

    > Hi, I'm stuck with a pattern match thing.
    >
    > What I actually want a script to do is the following:
    >
    > look for <pre> and </pre> and erase all the <br> that you find within it, no
    > matter what you find. However: leave the rest! ( linebreaks etc.)
    >
    > But I don't know how to do it properly.


    Use an HTML parser module.

    > I tried doing this:
    >
    > $rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;


    Do not attempt to process HTML using just regex - it simply isn't
    worth the effort.

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
     
    Brian McCauley, Jun 17, 2004
    #2
    1. Advertising

  3. Brian McCauley wrote:
    > "Lex" <> writes:
    >> What I actually want a script to do is the following:
    >>
    >> look for <pre> and </pre> and erase all the <br> that you find
    >> within it, no matter what you find. However: leave the rest! (
    >> linebreaks etc.)
    >>
    >> But I don't know how to do it properly.

    >
    > Use an HTML parser module.
    >
    >> I tried doing this:
    >>
    >> $rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

    >
    > Do not attempt to process HTML using just regex - it simply isn't
    > worth the effort.


    That's too categoric IMO. This problem appears to be rather limited,
    and under certain conditions, the OP's need may well be served through
    something like this:

    $rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
    (my $rest = $1) =~ s/<br.*?>//gis;
    $rest
    }egis;

    To the OP: Please study the FAQ:

    perldoc -q "remove HTML"

    and consider whether using the s/// operator like above would be
    'safe' enough for your case.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 18, 2004
    #3
  4. Lex

    Matt Garrish Guest

    "Michal Wojciechowski" <> wrote in message
    news:...
    > "Lex" <> writes:
    >
    > [...]
    >
    > > look for <pre> and </pre> and erase all the <br> that you find
    > > within it, no matter what you find. However: leave the rest! (
    > > linebreaks etc.)

    >
    > [...]
    >
    > > $rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

    >
    > The above would work, if it could match overlapping occurrences. One
    > solution is to use it in a loop, like:
    >
    > while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}
    >


    Two quick things: you want foreach not while, and pre and break tags can
    include style definitions etc., so best to check for <br[^>]*>.

    foreach (s!<pre[^>]*>(.*?)<br[^>]*>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

    I'd give my vote to Gunnar's method, though, as you could wind up doing many
    passes over the file this way before you clear them all out (though what
    <br> tags are doing inside <pre> tags eludes me at the moment).

    Matt
     
    Matt Garrish, Jun 18, 2004
    #4
  5. Lex

    Joe Smith Guest

    Matt Garrish wrote:

    > "Michal Wojciechowski" <> wrote in message
    > news:...
    >
    >>"Lex" <> writes:
    >>
    >>[...]
    >>
    >>
    >>>look for <pre> and </pre> and erase all the <br> that you find
    >>>within it, no matter what you find. However: leave the rest! (
    >>>linebreaks etc.)

    >>
    >>[...]
    >>
    >>
    >>>$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;

    >>
    >>The above would work, if it could match overlapping occurrences. One
    >>solution is to use it in a loop, like:
    >>
    >> while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

    >
    > Two quick things: you want foreach not while, and pre and break tags can
    > include style definitions etc., so best to check for <br[^>]*>.


    No, foreach() will remove only the first <br>, not all of them.

    The code below prints partial results so that you can see the
    loop's actions.

    unix% cat temp.pl
    $string = "<pre>foo<br>bar<br>baz<br>xyzzy<br>quux</pre>";

    $_ = $string;
    while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) { print "Part:$_\n";}
    print "End while(): $_\n";

    $_ = $string;
    print "Part:$_\n" while s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig;
    print "End 1 while: $_\n";

    $_ = $string;
    foreach (s!<pre[^>]*>(.*?)<br[^>]*>(.*?)</pre>!<pre>$1 $2</pre>!sig) { print
    "Part:$_\n";}
    print "End foreach: $_\n";

    unix% perl temp.pl
    Part:<pre>foo bar<br>baz<br>xyzzy<br>quux</pre>
    Part:<pre>foo bar baz<br>xyzzy<br>quux</pre>
    Part:<pre>foo bar baz xyzzy<br>quux</pre>
    Part:<pre>foo bar baz xyzzy quux</pre>
    End while(): <pre>foo bar baz xyzzy quux</pre>
    Part:<pre>foo bar<br>baz<br>xyzzy<br>quux</pre>
    Part:<pre>foo bar baz<br>xyzzy<br>quux</pre>
    Part:<pre>foo bar baz xyzzy<br>quux</pre>
    Part:<pre>foo bar baz xyzzy quux</pre>
    End 1 while: <pre>foo bar baz xyzzy quux</pre>
    Part:1
    End foreach: <pre>foo bar<br>baz<br>xyzzy<br>quux</pre>

    -Joe
     
    Joe Smith, Jun 18, 2004
    #5
  6. Lex

    Lex Guest

    "Gunnar Hjalmarsson" <> wrote in message
    news:...

    > This problem appears to be rather limited,
    > and under certain conditions, the OP's need may well be served through
    > something like this:
    >
    > $rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
    > (my $rest = $1) =~ s/<br.*?>//gis;
    > $rest
    > }egis;
    >
    > To the OP: Please study the FAQ:
    >
    > perldoc -q "remove HTML"
    >
    > and consider whether using the s/// operator like above would be
    > 'safe' enough for your case.
    >


    Thanks a lot Gunnar!
    It works like a charm.
    It would be safe enough for my case as there is nothing more than <br> tags
    within the <pre> and </pre> tags. I've got control over that, it's not
    parsing just any html file you see. Just pieces of text from a database.

    Lex
     
    Lex, Jun 18, 2004
    #6
  7. Lex wrote:
    > Gunnar Hjalmarsson wrote:
    >>
    >> $rec{'Text'} =~ s{(<pre.*?>.+?</pre>)}{
    >> (my $rest = $1) =~ s/<br.*?>//gis;
    >> $rest
    >> }egis;

    >
    > Thanks a lot Gunnar!
    > It works like a charm.


    Good.

    > It would be safe enough for my case as there is nothing more than
    > <br> tags within the <pre> and </pre> tags. I've got control over
    > that, it's not parsing just any html file you see. Just pieces of
    > text from a database.


    That's what I suspected.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 18, 2004
    #7
  8. Lex

    Matt Garrish Guest

    "Joe Smith" <> wrote in message
    news:v1xAc.62123$eu.27793@attbi_s02...
    > Matt Garrish wrote:
    >
    > > "Michal Wojciechowski" <> wrote in message
    > > news:...
    > >
    > >>"Lex" <> writes:
    > >>
    > >>[...]
    > >>
    > >>
    > >>>look for <pre> and </pre> and erase all the <br> that you find
    > >>>within it, no matter what you find. However: leave the rest! (
    > >>>linebreaks etc.)
    > >>
    > >>[...]
    > >>
    > >>
    > >>>$rec{'Text'} =~ s%<pre>(.*?)<br>(.*?)</pre>%<pre>$1 $2</pre>%gim;
    > >>
    > >>The above would work, if it could match overlapping occurrences. One
    > >>solution is to use it in a loop, like:
    > >>
    > >> while (s!<pre>(.*?)<br>(.*?)</pre>!<pre>$1 $2</pre>!sig) {}

    > >
    > > Two quick things: you want foreach not while, and pre and break tags can
    > > include style definitions etc., so best to check for <br[^>]*>.

    >
    > No, foreach() will remove only the first <br>, not all of them.
    >


    Ugh, that was just bad on my part, especially since I knew he wanted
    multiple passes to clear them out. I ran it with <br /> tags before
    modifying the expression, which is why it looked like it wasn't working (I
    was just going to make mention of the html formatting, because there's even
    less of a point in using a regex if you aren't going to make sure you
    capture as many oddities as you can).

    Matt
     
    Matt Garrish, Jun 18, 2004
    #8
  9. Lex

    Lex Guest

    "Matt Garrish" <> wrote in message
    news:F2CAc.39052$...
    >
    >
    > Ugh, that was just bad on my part, especially since I knew he wanted
    > multiple passes to clear them out. I ran it with <br /> tags before
    > modifying the expression, which is why it looked like it wasn't working (I
    > was just going to make mention of the html formatting, because there's

    even
    > less of a point in using a regex if you aren't going to make sure you
    > capture as many oddities as you can).
    >

    Well, I know for sure they'll be just <br> and nothing else, they're put
    there by my script earlier, replacing \n in plain text you see...

    But I'e got it working thanks to you all.

    Lex
     
    Lex, Jun 18, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris

    Pattern match problem

    Chris, Jan 14, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    110
    Chris
    Jan 14, 2004
  2. Dafke8

    problem with pattern match

    Dafke8, Apr 30, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    98
    Dafke8
    May 1, 2004
  3. Niall Macpherson

    Problem with memory usage in pattern match

    Niall Macpherson, Dec 5, 2005, in forum: Perl Misc
    Replies:
    2
    Views:
    190
    Anno Siegel
    Dec 9, 2005
  4. neilsolent

    Problem with pattern match

    neilsolent, Mar 8, 2007, in forum: Perl Misc
    Replies:
    3
    Views:
    93
    Mumia W.
    Mar 8, 2007
  5. samuel

    Multiple Line Pattern Match problem

    samuel, May 31, 2007, in forum: Perl Misc
    Replies:
    7
    Views:
    150
    samuel
    Jun 4, 2007
Loading...

Share This Page