regex problem

  • Thread starter Andrea Petersen
  • Start date
A

Andrea Petersen

Hello,

i'm new to using regex and working on the following problem for 6 hours now.
(damn)


<body>

<div>MATCH
<b>ONE</b></div>

<p> foobar </p>

<div>MATCH <b>TWO</b></div>

</body>


I want to extract the two strings between <div> and </div>. The new line in
the first div-tag drives me crazy.

Can somebody help me out with this? I'm searching for the regular expression
to match this tags.

regards,
Andrea
 
P

Paul Lalli

Andrea said:
i'm new to using regex and working on the following problem for 6 hours now.
(damn)


<body>

<div>MATCH
<b>ONE</b></div>

<p> foobar </p>

<div>MATCH <b>TWO</b></div>

</body>


I want to extract the two strings between <div> and </div>. The new line in
the first div-tag drives me crazy.

Can somebody help me out with this? I'm searching for the regular expression
to match this tags.

Since you don't show what you're doing, it's impossible for us to show
you what you're doing wrong. Please post a short-but-complete script
that demonstrates the error. Have you read the Posting Guidelines that
are posted here twice a week?

I'll take a stab in the dark and guess that you are processing the file
line-by-line, but that you're trying to use a pattern match to find a
pattern that spans more than one line. That obviously won't work.

You need to either: 1) Read the entire file into one big scalar, or 2)
(preferably) use a module that was actually designed for HTML parsing,
rather than rolling your own.

1) Several options: (a) read the file into an array and join() it
together with the empty string; (b) set $/ to undef before reading, and
do one read to a scalar; or (c) use the File::Slurp module
2) Several options, including HTML::parser and HTML::TokeParser. I
recommend the latter. Here's an example using that module:

#!/opt2/perl/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
use Data::Dumper;

my $p = HTML::TokeParser->new(\*DATA);
while (my $token = $p->get_tag("div")){
my $val;
while (my $token = $p->get_token()){
last if $token->[-1] eq '</div>';
if ($token->[0] =~ /^[SECD]$/) {
$val .= $token->[-1];
} elsif ($token->[0] eq 'T') {
$val .= $token->[1];
}
}
print "Found: '$val'\n";
}

__DATA__
<body>
<div>MATCH
<b>ONE</b></div>
<p> foobar </p>
<div>MATCH <b>TWO</b></div>
</body>


This outputs:
Found: 'MATCH
<b>ONE</b>'
Found: 'MATCH <b>TWO</b>'

Hope this helps,
Paul Lalli
 
A

Andrea Petersen

My problem:

<div>[a-z0-9\,\<\>\/ ]*</div>

does not match the first div-tag. (but matches the second)
 
P

Paul Lalli

Andrea said:
My problem:

<div>[a-z0-9\,\<\>\/ ]*</div>

Why are you limiting it to only those characters. Why are you assuming
people can't type any punctuation between the div tags?

Further, why are you back-slashing the comma, less-than, greater-than,
and front slash in the character class? None of them need to be
escaped. (The slash might, if you are using slashes as your regexp
delimiter, but if that were the case, you'd need to escape the one in
the said:
does not match the first div-tag. (but matches the second)

Of course not. You didn't tell it to match newlines.
<div>[\n a-z0-9\,\<\>\/]*</div>

matches everything between the first <div> and the very last </div>

Right. You told it to match as much of that stuff as possible. If you
want quantifiers (like the *) to be non-greedy, add a ? after them.
my somach hurts. regexs are very unhealthy.

No, regexps are perfectly healthy. It's your usage of them that's
causing your stomachache. As I and someone else have both now told
you, Regexps are not meant to parse HTML. Even when you make the
change above, just because it "works" for your given case is no
guarantee that it will work for any other data.

Paul Lalli
 
A

Andrea Petersen

Right. You told it to match as much of that stuff as possible. If you
want quantifiers (like the *) to be non-greedy, add a ? after them.

<div>[\n\r a-z0-9\,\<\>\/]*?</div> [added the ?]

does the job perfectly. Thank you *very* much.

Andrea
 
U

Uri Guttman

AP" == Andrea Petersen said:
<div>[\n a-z0-9\,\<\>\/]*</div>

matches everything between the first <div> and the very last </div>

Right. You told it to match as much of that stuff as possible. If you
want quantifiers (like the *) to be non-greedy, add a ? after them.

AP> <div>[\n\r a-z0-9\,\<\>\/]*?</div> [added the ?]

AP> does the job perfectly. Thank you *very* much.

no, it is FAR from perfect. you would be better for it if you realized
this. delusions of regex perfection will make more than your stomach
hurt. wait until the div crosses lines or has other chars in it. there
are too many ways to count to break that regex.

uri
 
P

Paul

If you insist on using a regex, the following should work:

my @strings = $string =~ m|<div>(.+?)</div>|sg;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top