<p>(.*)</p> Doesn't Work

Discussion in 'Perl Misc' started by Howard Best, Jun 14, 2006.

  1. Howard Best

    Howard Best Guest

    When trying to match HTML paragraphs using Perl:

    1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
    paragraph is on more than one line?

    2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
    the first <p and the last </p> in the file.

    2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
    there's a <b>...</b>, etc. within the paragraph?

    What is the solution?
     
    Howard Best, Jun 14, 2006
    #1
    1. Advertising

  2. Howard Best

    Brian Wakem Guest

    Howard Best wrote:

    > When trying to match HTML paragraphs using Perl:
    >
    > 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
    > paragraph is on more than one line?
    >
    > 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
    > the first <p and the last </p> in the file.



    Put a ? after the *


    > 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
    > there's a <b>...</b>, etc. within the paragraph?



    You can't have 2 number 2's!


    --
    Brian Wakem
    Email: http://homepage.ntlworld.com/b.wakem/myemail.png
     
    Brian Wakem, Jun 14, 2006
    #2
    1. Advertising

  3. Howard Best

    Paul Lalli Guest

    Howard Best wrote:
    > When trying to match HTML paragraphs using Perl:


    .... you should be using a module specifically designed for HTML
    parsing, like, for example, HTML::parser.

    Regular expressions are simply not up to the task.

    >
    > 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
    > paragraph is on more than one line?


    Then you'd have to either put *all* the text into $buffer, or set up
    markers as you're going through all the lines - one to find the opening
    <p>, one to find the closing </p>.

    Btw, what do you think the above is doing? You're saying to find all
    instances of text between <p and </p>, and to add <p> and </p> tags
    around it? So that would produce:
    <p<p>This is my paragraph</p></p>
    What is the point of such a thing?

    > 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
    > the first <p and the last </p> in the file.


    No, it matches the first <p, and the () capture EVERYTHING that it can
    and still allow the pattern to succeed, because you told the pattern to
    be greedy. If you want it to be non-greedy, add a ? after the *

    > 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
    > there's a <b>...</b>, etc. within the paragraph?


    Yup. Don't do that.

    >
    > What is the solution?


    To use a module that is made for parsing HTML, like HTML::parser.

    Paul Lalli
     
    Paul Lalli, Jun 14, 2006
    #3
  4. Howard Best

    Guest


    > When trying to match HTML paragraphs using Perl:


    I was just doing the same thing..

    Note: I'm using the output of Win32::IE::Mechanize, and it reorders the
    original HTML, so I'd suggest always printing the variable before you
    =~ it (thanks for the tips, Bart & Gleixner!)


    > 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
    > paragraph is on more than one line?
    >


    Use the match modifier s.

    > 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
    > the first <p and the last </p> in the file.


    ..* matching is greedy by default. There's afaik a switch to ungreedify
    it.


    > 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
    > there's a <b>...</b>, etc. within the paragraph?


    To get just the para you could try other things such as HTML
    Treebuilder. Works well, but memory hungry.

    > What is the solution?


    42, of course ;-)

    Here's what I used in a similar situation:

    print "\n ==\n\tContent of VV page: $content\n\n";
    $content =~ m/navbar(.*)<\/TABLE><BR>/ism;
    print "I think tbl is approx:\n $1\n";
    $tbl=$1;
    my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;
    $infostr = join "\n", @info_to_keep[1 .. $#info_to_keep];
    print "Found Valid Values:\n$infostr \nSkipped Value:
    $info_to_keep[0]\n\n";


    In the code above, rather than find the 'exact' html table, I opted for
    'pseudo-semantic' (ie, unique) strings to cut the search space down.

    I am looking for rows within an html table. So first I =~ out an
    approximate chunk of text containing the table (without bothering about
    precise start and end tags).

    s is for matching .* across \n's -- note that by default it doesn't.
    g matches multiple times, and the result is returned in list context.

    m is for multi-line matching, not sure if s is necessary when m is
    present.
     
    , Jun 14, 2006
    #4
  5. Howard Best

    Howard Best Guest

    Brian Wakem wrote:

    > Put a ? after the *


    Thanks, Brian. That did it! Here's a portion of the code that I used to
    test it:

    open(IN,$filename) or die "Can't open \"$filename\": $!.\n";
    @buffer=<IN>;
    close(IN);
    $buffer=join('',@buffer);
    while($buffer=~s/(<p.*?<\/p>)//s)
    {
    print OUT "\n*****************\n$1\n*****************\n";
    }

    > > 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
    > > there's a <b>...</b>, etc. within the paragraph?

    >
    >
    > You can't have 2 number 2's!


    Sorry about that. It's that ol' senility kicking in!
     
    Howard Best, Jun 14, 2006
    #5
  6. Howard Best

    Howard Best Guest

    Paul Lalli wrote:
    > ... you should be using a module specifically designed for HTML
    > parsing, like, for example, HTML::parser.


    Thanks, Paul. I'll check it out.

    Howard
     
    Howard Best, Jun 14, 2006
    #6
  7. <> wrote:
    >
    >> When trying to match HTML paragraphs using Perl:

    >
    > I was just doing the same thing..




    > $content =~ m/navbar(.*)<\/TABLE><BR>/ism;



    m//m affects the meaning of ^ and $, it is useless when
    your pattern does not use those anchors.


    > my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;



    There is a module specifically for prying the data out of HTML tables:

    use HTML::TableExtract;


    > s is for matching .* across \n's



    Actually, m//s makes dot match a newline (whether the dot is asterisked or not).

    > g matches multiple times, and the result is returned in list context.



    The "g" modifier has absolutely no connection with the context that
    the m// operator is in!

    It is the assignment (=) that puts the m// in list context, not
    the "g" modifier.


    > m is for multi-line matching, not sure if s is necessary when m is
    > present.



    They do different things, so the presence of one has nothing
    to do with the other.

    If you want dot to match a newline use "s".

    If you want ^ and & to match "lines" rather than "strings", use "m".


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jun 14, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VB Programmer
    Replies:
    1
    Views:
    636
    VB Programmer
    Jan 26, 2006
  2. MZ
    Replies:
    7
    Views:
    849
    Ed Mullen
    Mar 17, 2008
  3. Vasu
    Replies:
    2
    Views:
    612
    Knute Johnson
    Oct 18, 2008
  4. Tilman
    Replies:
    0
    Views:
    425
    Tilman
    Mar 19, 2008
  5. rigo
    Replies:
    0
    Views:
    183
Loading...

Share This Page