regex problem

Andrea Petersen · Jan 3, 2007

Hello,

i'm new to using regex and working on the following problem for 6 hours now.
(damn)

<body>

<div>MATCH
<b>ONE</b></div>

<p> foobar </p>

<div>MATCH <b>TWO</b></div>

</body>

I want to extract the two strings between <div> and </div>. The new line in
the first div-tag drives me crazy.

Can somebody help me out with this? I'm searching for the regular expression
to match this tags.

regards,
Andrea

Paul Lalli · Jan 3, 2007

Andrea said:
i'm new to using regex and working on the following problem for 6 hours now.
(damn)

<body>

<div>MATCH
<b>ONE</b></div>

<p> foobar </p>

<div>MATCH <b>TWO</b></div>

</body>

I want to extract the two strings between <div> and </div>. The new line in
the first div-tag drives me crazy.

Can somebody help me out with this? I'm searching for the regular expression
to match this tags.

Since you don't show what you're doing, it's impossible for us to show
you what you're doing wrong. Please post a short-but-complete script
that demonstrates the error. Have you read the Posting Guidelines that
are posted here twice a week?

I'll take a stab in the dark and guess that you are processing the file
line-by-line, but that you're trying to use a pattern match to find a
pattern that spans more than one line. That obviously won't work.

You need to either: 1) Read the entire file into one big scalar, or 2)
(preferably) use a module that was actually designed for HTML parsing,
rather than rolling your own.

1) Several options: (a) read the file into an array and join() it
together with the empty string; (b) set $/ to undef before reading, and
do one read to a scalar; or (c) use the File::Slurp module
2) Several options, including HTML:

arser and HTML::TokeParser. I
recommend the latter. Here's an example using that module:

#!/opt2/perl/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
use Data:

umper;

my $p = HTML::TokeParser->new(\*DATA);
while (my $token = $p->get_tag("div")){
my $val;
while (my $token = $p->get_token()){
last if $token->[-1] eq '</div>';
if ($token->[0] =~ /^[SECD]$/) {
$val .= $token->[-1];
} elsif ($token->[0] eq 'T') {
$val .= $token->[1];
}
}
print "Found: '$val'\n";
}

__DATA__
<body>
<div>MATCH
<b>ONE</b></div>
<p> foobar </p>
<div>MATCH <b>TWO</b></div>
</body>

This outputs:
Found: 'MATCH
<b>ONE</b>'
Found: 'MATCH <b>TWO</b>'

Hope this helps,
Paul Lalli

Andrea Petersen · Jan 3, 2007

My problem:

<div>[a-z0-9\,\<\>\/ ]*</div>

does not match the first div-tag. (but matches the second)

Paul Lalli · Jan 3, 2007

Andrea said:
My problem:

<div>[a-z0-9\,\<\>\/ ]*</div>

Why are you limiting it to only those characters. Why are you assuming
people can't type any punctuation between the div tags?

Further, why are you back-slashing the comma, less-than, greater-than,
and front slash in the character class? None of them need to be
escaped. (The slash might, if you are using slashes as your regexp
delimiter, but if that were the case, you'd need to escape the one in

the said:
does not match the first div-tag. (but matches the second)

Of course not. You didn't tell it to match newlines.

<div>[\n a-z0-9\,\<\>\/]*</div>

matches everything between the first <div> and the very last </div>

Right. You told it to match as much of that stuff as possible. If you
want quantifiers (like the *) to be non-greedy, add a ? after them.

my somach hurts. regexs are very unhealthy.

No, regexps are perfectly healthy. It's your usage of them that's
causing your stomachache. As I and someone else have both now told
you, Regexps are not meant to parse HTML. Even when you make the
change above, just because it "works" for your given case is no
guarantee that it will work for any other data.

Paul Lalli

Andrea Petersen · Jan 3, 2007

Right. You told it to match as much of that stuff as possible. If you
want quantifiers (like the *) to be non-greedy, add a ? after them.

<div>[\n\r a-z0-9\,\<\>\/]*?</div> [added the ?]

does the job perfectly. Thank you *very* much.

Andrea

Uri Guttman · Jan 3, 2007

AP" == Andrea Petersen said:
<div>[\n a-z0-9\,\<\>\/]*</div>

matches everything between the first <div> and the very last </div>

Click to expand...

Right. You told it to match as much of that stuff as possible. If you
want quantifiers (like the *) to be non-greedy, add a ? after them.

Click to expand...

AP> <div>[\n\r a-z0-9\,\<\>\/]*?</div> [added the ?]

AP> does the job perfectly. Thank you *very* much.

no, it is FAR from perfect. you would be better for it if you realized
this. delusions of regex perfection will make more than your stomach
hurt. wait until the div crosses lines or has other chars in it. there
are too many ways to count to break that regex.

uri

Paul · Jan 19, 2007

If you insist on using a regex, the following should work:

my @strings = $string =~ m|<div>(.+?)</div>|sg;

Help with code	0	Jun 11, 2022
Only one table shows up with the information	2	Mar 29, 2023
Having difficulty with the layout of these images / video for this web page	2	Jul 4, 2022
Final chapter of "Learn PHP, MySQL and JavaScript"	3	Jun 4, 2024
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023
Google sheets song request	3	Apr 19, 2022
Help with my responsive home page	2	Dec 14, 2022
Issue with textbox script?	0	Sep 4, 2022

regex problem

Andrea Petersen

Paul Lalli

Andrea Petersen

Paul Lalli

Andrea Petersen

Uri Guttman

Paul

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads