question on processing HTML with a regex

cuneyt · Dec 13, 2006

Hi,

I would like to process an HTML file in the form

<tr>
row1
</tr>
<tr>
row2
</tr>

The snippet I wrote is

while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
my $r = $1;
print "found: $r\n";
}

But when I run this I get
found:
row1
</tr>
<tr>
row2

How can I modify the regex so that it is not so greedy and pulls one
<tr></tr> pair at a time?

Thanks a lot

Cuneyt

xhoster · Dec 13, 2006

cuneyt said:
How can I modify the regex so that it is not so greedy and pulls one
<tr></tr> pair at a time?

perldoc perlre, search for "greedy".

Xho

Paul Lalli · Dec 13, 2006

cuneyt said:
Hi,

I would like to process an HTML file in the form

<tr>
row1
</tr>
<tr>
row2
</tr>

The snippet I wrote is

while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
my $r = $1;
print "found: $r\n";
}

You should not, in general, be processing HTML with regular
expressions. You should be using an HTML parser. There are many
available on CPAN. I recommend HTML::TokeParser.

But when I run this I get
found:
row1
</tr>
<tr>
row2

How can I modify the regex so that it is not so greedy and pulls one
<tr></tr> pair at a time?

You need to read a decent tutorial on regular expressions, as this is a
very basic question. The answer is that you need a ? after the *, but
if you didn't already know that, you *really* need to read:
perldoc perlretut

Hope this helps,
Paul Lalli

John Bokma · Dec 13, 2006

cuneyt said:
while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {

I recommend not to use CamelCase in Perl, but use _ instead, e.g.
$html_source.

my $r = $1;

What is $r? You might be asking yourself that question after a month or
so.

print "found: $r\n";
}

As for using regexp, it's way to fragile in too many cases. Have a look at
HTML::TreeBuilder:

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content( $html_source );
my @tr_elements = $tree->look_down( _tag => 'tr' );
for my $tr_element ( @tr_elements ) {

# ...
}

http://johnbokma.com/perl/

has several HTML::TreeBuilder examples.

cuneyt · Dec 13, 2006

I started to use HTML::Tokeparser and what a relief! Makes parsing a
breeze even for a regex newbie like me.

Thanks for all the comments

Cuneyt

Tad McClellan · Dec 14, 2006

Paul Lalli said:
cuneyt wrote:

You should not, in general, be processing HTML with regular
expressions. You should be using an HTML parser. There are many
available on CPAN. I recommend HTML::TokeParser.

And if you only need the data that is in tables,
then HTML::TableExtract is Very Nice.

Mumia W. (on aioe) · Dec 16, 2006

On Sat, 16 Dec 2006 02:56:26 GMT, (e-mail address removed) wrote:

And the funny thing is my friends the regulars here call this
CRAP code !!!

HAHAHAHAAAAAAAAA

God is here my friend.....

robic0

Now my stomach hurts from ten continuous minutes of uncontrollable
laughter.

Robic0, you actually /believe/ your program is good code.

I can't ... I can't ... ROTFLOL ....

I can't imagine what's going on in the brain of someone who would think
that.

If I've gotten a hernia I'm sending you the bill. Don't worry, however;
I won't be able to see your response to it. Buh bye.

When I send email as HTML, why do erroneous whitespaces getintroduced to the HTML source and a few <	2	Nov 8, 2013
Issue with textbox script?	0	Sep 5, 2022
regexp on HTML	5	Nov 17, 2003
Reading lines from a file into an array	2	May 1, 2010
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
comparing a 2D array	10	Mar 13, 2008
regex problem	7	Jun 12, 2009
Question about regex (nagios plugin)	8	Sep 30, 2008

question on processing HTML with a regex

cuneyt

xhoster

Paul Lalli

John Bokma

cuneyt

Tad McClellan

Mumia W. (on aioe)

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads