<p>(.*)</p> Doesn't Work

H

Howard Best

When trying to match HTML paragraphs using Perl:

1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?

2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.

2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?

What is the solution?
 
B

Brian Wakem

Howard said:
When trying to match HTML paragraphs using Perl:

1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?

2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.


Put a ? after the *

2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?


You can't have 2 number 2's!
 
P

Paul Lalli

Howard said:
When trying to match HTML paragraphs using Perl:

.... you should be using a module specifically designed for HTML
parsing, like, for example, HTML::parser.

Regular expressions are simply not up to the task.
1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?

Then you'd have to either put *all* the text into $buffer, or set up
markers as you're going through all the lines - one to find the opening
<p>, one to find the closing </p>.

Btw, what do you think the above is doing? You're saying to find all
instances of text between <p and </p>, and to add <p> and </p> tags
around it? So that would produce:
2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.

No, it matches the first <p, and the () capture EVERYTHING that it can
and still allow the pattern to succeed, because you told the pattern to
be greedy. If you want it to be non-greedy, add a ? after the *
2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?

Yup. Don't do that.
What is the solution?

To use a module that is made for parsing HTML, like HTML::parser.

Paul Lalli
 
N

nsb_tsd

When trying to match HTML paragraphs using Perl:

I was just doing the same thing..

Note: I'm using the output of Win32::IE::Mechanize, and it reorders the
original HTML, so I'd suggest always printing the variable before you
=~ it (thanks for the tips, Bart & Gleixner!)

1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?

Use the match modifier s.
2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.

..* matching is greedy by default. There's afaik a switch to ungreedify
it.

2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?

To get just the para you could try other things such as HTML
Treebuilder. Works well, but memory hungry.
What is the solution?

42, of course ;-)

Here's what I used in a similar situation:

print "\n ==\n\tContent of VV page: $content\n\n";
$content =~ m/navbar(.*)<\/TABLE><BR>/ism;
print "I think tbl is approx:\n $1\n";
$tbl=$1;
my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;
$infostr = join "\n", @info_to_keep[1 .. $#info_to_keep];
print "Found Valid Values:\n$infostr \nSkipped Value:
$info_to_keep[0]\n\n";


In the code above, rather than find the 'exact' html table, I opted for
'pseudo-semantic' (ie, unique) strings to cut the search space down.

I am looking for rows within an html table. So first I =~ out an
approximate chunk of text containing the table (without bothering about
precise start and end tags).

s is for matching .* across \n's -- note that by default it doesn't.
g matches multiple times, and the result is returned in list context.

m is for multi-line matching, not sure if s is necessary when m is
present.
 
H

Howard Best

Brian said:
Put a ? after the *

Thanks, Brian. That did it! Here's a portion of the code that I used to
test it:

open(IN,$filename) or die "Can't open \"$filename\": $!.\n";
@buffer=<IN>;
close(IN);
$buffer=join('',@buffer);
while($buffer=~s/(<p.*?<\/p>)//s)
{
print OUT "\n*****************\n$1\n*****************\n";
}
2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?


You can't have 2 number 2's!

Sorry about that. It's that ol' senility kicking in!
 
H

Howard Best

Paul said:
... you should be using a module specifically designed for HTML
parsing, like, for example, HTML::parser.

Thanks, Paul. I'll check it out.

Howard
 
T

Tad McClellan

I was just doing the same thing..


$content =~ m/navbar(.*)<\/TABLE><BR>/ism;


m//m affects the meaning of ^ and $, it is useless when
your pattern does not use those anchors.

my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;


There is a module specifically for prying the data out of HTML tables:

use HTML::TableExtract;

s is for matching .* across \n's


Actually, m//s makes dot match a newline (whether the dot is asterisked or not).
g matches multiple times, and the result is returned in list context.


The "g" modifier has absolutely no connection with the context that
the m// operator is in!

It is the assignment (=) that puts the m// in list context, not
the "g" modifier.

m is for multi-line matching, not sure if s is necessary when m is
present.


They do different things, so the presence of one has nothing
to do with the other.

If you want dot to match a newline use "s".

If you want ^ and & to match "lines" rather than "strings", use "m".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top