question on processing HTML with a regex

C

cuneyt

Hi,

I would like to process an HTML file in the form

<tr>
row1
</tr>
<tr>
row2
</tr>

The snippet I wrote is

while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
my $r = $1;
print "found: $r\n";
}

But when I run this I get
found:
row1
</tr>
<tr>
row2

How can I modify the regex so that it is not so greedy and pulls one
<tr></tr> pair at a time?

Thanks a lot

Cuneyt
 
P

Paul Lalli

cuneyt said:
Hi,

I would like to process an HTML file in the form

<tr>
row1
</tr>
<tr>
row2
</tr>

The snippet I wrote is

while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
my $r = $1;
print "found: $r\n";
}

You should not, in general, be processing HTML with regular
expressions. You should be using an HTML parser. There are many
available on CPAN. I recommend HTML::TokeParser.
But when I run this I get
found:
row1
</tr>
<tr>
row2

How can I modify the regex so that it is not so greedy and pulls one
<tr></tr> pair at a time?

You need to read a decent tutorial on regular expressions, as this is a
very basic question. The answer is that you need a ? after the *, but
if you didn't already know that, you *really* need to read:
perldoc perlretut

Hope this helps,
Paul Lalli
 
J

John Bokma

cuneyt said:
while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {

I recommend not to use CamelCase in Perl, but use _ instead, e.g.
$html_source.
my $r = $1;

What is $r? You might be asking yourself that question after a month or
so.
print "found: $r\n";
}

As for using regexp, it's way to fragile in too many cases. Have a look at
HTML::TreeBuilder:

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content( $html_source );
my @tr_elements = $tree->look_down( _tag => 'tr' );
for my $tr_element ( @tr_elements ) {

# ...
}


http://johnbokma.com/perl/

has several HTML::TreeBuilder examples.
 
C

cuneyt

I started to use HTML::Tokeparser and what a relief! Makes parsing a
breeze even for a regex newbie like me.

Thanks for all the comments

Cuneyt
 
T

Tad McClellan

Paul Lalli said:
cuneyt wrote:
You should not, in general, be processing HTML with regular
expressions. You should be using an HTML parser. There are many
available on CPAN. I recommend HTML::TokeParser.


And if you only need the data that is in tables,
then HTML::TableExtract is Very Nice.
 
M

Mumia W. (on aioe)

On Sat, 16 Dec 2006 02:56:26 GMT, (e-mail address removed) wrote:

And the funny thing is my friends the regulars here call this
CRAP code !!!

HAHAHAHAAAAAAAAA

God is here my friend.....

robic0

Now my stomach hurts from ten continuous minutes of uncontrollable
laughter.

Robic0, you actually /believe/ your program is good code.

I can't ... I can't ... ROTFLOL ....

I can't imagine what's going on in the brain of someone who would think
that.

If I've gotten a hernia I'm sending you the bill. Don't worry, however;
I won't be able to see your response to it. Buh bye.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top