(.*) Doesn't Work

Howard Best · Jun 14, 2006

When trying to match HTML paragraphs using Perl:

1. $buffer=~s/<p(.*)/$1/g; doesn't work because what if the
paragraph is on more than one line?

2. $buffer=~s/<p(.*)/$1/sg doesn't work because it matches
the first in the file.

2. $buffer=~s/<p([^<]*)/$1/sg doesn't work because what if
there's a ..., etc. within the paragraph?

What is the solution?

Brian Wakem · Jun 14, 2006

Howard said:
When trying to match HTML paragraphs using Perl:

1. $buffer=~s/<p(.*)/$1/g; doesn't work because what if the
paragraph is on more than one line?

2. $buffer=~s/<p(.*)/$1/sg doesn't work because it matches
the first in the file.

Put a ? after the *

2. $buffer=~s/<p([^<]*)/$1/sg doesn't work because what if
there's a ..., etc. within the paragraph?

You can't have 2 number 2's!

Paul Lalli · Jun 14, 2006

Howard said:
When trying to match HTML paragraphs using Perl:

.... you should be using a module specifically designed for HTML
parsing, like, for example, HTML:

arser.

Regular expressions are simply not up to the task.

1. $buffer=~s/<p(.*)/$1/g; doesn't work because what if the
paragraph is on more than one line?

Then you'd have to either put *all* the text into $buffer, or set up
markers as you're going through all the lines - one to find the opening
, one to find the closing .

Btw, what do you think the above is doing? You're saying to find all
instances of text between , and to add and tags
around it? So that would produce:

2. $buffer=~s/<p(.*)/$1/sg doesn't work because it matches
the first in the file.

No, it matches the first <p, and the () capture EVERYTHING that it can
and still allow the pattern to succeed, because you told the pattern to
be greedy. If you want it to be non-greedy, add a ? after the *

2. $buffer=~s/<p([^<]*)/$1/sg doesn't work because what if
there's a ..., etc. within the paragraph?

Yup. Don't do that.

What is the solution?

To use a module that is made for parsing HTML, like HTML:

arser.

Paul Lalli

nsb_tsd · Jun 14, 2006

When trying to match HTML paragraphs using Perl:

I was just doing the same thing..

Note: I'm using the output of Win32::IE::Mechanize, and it reorders the
original HTML, so I'd suggest always printing the variable before you
=~ it (thanks for the tips, Bart & Gleixner!)

1. $buffer=~s/<p(.*)/$1/g; doesn't work because what if the
paragraph is on more than one line?

Use the match modifier s.

2. $buffer=~s/<p(.*)/$1/sg doesn't work because it matches
the first in the file.

..* matching is greedy by default. There's afaik a switch to ungreedify
it.

2. $buffer=~s/<p([^<]*)/$1/sg doesn't work because what if
there's a ..., etc. within the paragraph?

To get just the para you could try other things such as HTML
Treebuilder. Works well, but memory hungry.

What is the solution?

42, of course ;-)

Here's what I used in a similar situation:

print "\n ==\n\tContent of VV page: $content\n\n";
$content =~ m/navbar(.*)<\/TABLE> /ism;
print "I think tbl is approx:\n $1\n";
$tbl=$1;
my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;
$infostr = join "\n", @info_to_keep[1 .. $#info_to_keep];
print "Found Valid Values:\n$infostr \nSkipped Value:
$info_to_keep[0]\n\n";

In the code above, rather than find the 'exact' html table, I opted for
'pseudo-semantic' (ie, unique) strings to cut the search space down.

I am looking for rows within an html table. So first I =~ out an
approximate chunk of text containing the table (without bothering about
precise start and end tags).

s is for matching .* across \n's -- note that by default it doesn't.
g matches multiple times, and the result is returned in list context.

m is for multi-line matching, not sure if s is necessary when m is
present.

Howard Best · Jun 14, 2006

Brian said:
Put a ? after the *

Thanks, Brian. That did it! Here's a portion of the code that I used to
test it:

open(IN,$filename) or die "Can't open \"$filename\": $!.\n";
@buffer=<IN>;
close(IN);
$buffer=join('',@buffer);
while($buffer=~s/(<p.*?<\/p>)//s)
{
print OUT "\n*****************\n$1\n*****************\n";
}

2. $buffer=~s/<p([^<]*)/$1/sg doesn't work because what if
there's a ..., etc. within the paragraph?

Click to expand...

You can't have 2 number 2's!

Sorry about that. It's that ol' senility kicking in!

Howard Best · Jun 14, 2006

Paul said:
... you should be using a module specifically designed for HTML
parsing, like, for example, HTML:arser.

Thanks, Paul. I'll check it out.

Howard

Tad McClellan · Jun 14, 2006

I was just doing the same thing..

$content =~ m/navbar(.*)<\/TABLE> /ism;

m//m affects the meaning of ^ and $, it is useless when
your pattern does not use those anchors.

my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;

There is a module specifically for prying the data out of HTML tables:

use HTML::TableExtract;

s is for matching .* across \n's

Actually, m//s makes dot match a newline (whether the dot is asterisked or not).

g matches multiple times, and the result is returned in list context.

The "g" modifier has absolutely no connection with the context that
the m// operator is in!

It is the assignment (=) that puts the m// in list context, not
the "g" modifier.

m is for multi-line matching, not sure if s is necessary when m is
present.

They do different things, so the presence of one has nothing
to do with the other.

If you want dot to match a newline use "s".

If you want ^ and & to match "lines" rather than "strings", use "m".

Lexical Analysis on C++	1	Oct 31, 2023
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Positioning a popup	10	Dec 13, 2020
Blue J Ciphertext Program	2	Nov 22, 2023
Can't solve problems! please Help	0	Sep 26, 2022
Function is not worked in C	2	Jun 27, 2023
Array of structs function pointer	10	Jul 16, 2023
An unknown bug doesn't allow the quotes app to work. What's the issue?	3	Apr 23, 2023

<p>(.*)</p> Doesn't Work

Howard Best

Brian Wakem

Paul Lalli

nsb_tsd

Howard Best

Howard Best

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads