When trying to match HTML paragraphs using Perl:
I was just doing the same thing..
Note: I'm using the output of Win32::IE::Mechanize, and it reorders the
original HTML, so I'd suggest always printing the variable before you
=~ it (thanks for the tips, Bart & Gleixner!)
1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?
Use the match modifier s.
2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.
..* matching is greedy by default. There's afaik a switch to ungreedify
it.
2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?
To get just the para you could try other things such as HTML
Treebuilder. Works well, but memory hungry.
42, of course ;-)
Here's what I used in a similar situation:
print "\n ==\n\tContent of VV page: $content\n\n";
$content =~ m/navbar(.*)<\/TABLE><BR>/ism;
print "I think tbl is approx:\n $1\n";
$tbl=$1;
my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;
$infostr = join "\n", @info_to_keep[1 .. $#info_to_keep];
print "Found Valid Values:\n$infostr \nSkipped Value:
$info_to_keep[0]\n\n";
In the code above, rather than find the 'exact' html table, I opted for
'pseudo-semantic' (ie, unique) strings to cut the search space down.
I am looking for rows within an html table. So first I =~ out an
approximate chunk of text containing the table (without bothering about
precise start and end tags).
s is for matching .* across \n's -- note that by default it doesn't.
g matches multiple times, and the result is returned in list context.
m is for multi-line matching, not sure if s is necessary when m is
present.