Andrea said:
i'm new to using regex and working on the following problem for 6 hours now.
(damn)
<body>
<div>MATCH
<b>ONE</b></div>
<p> foobar </p>
<div>MATCH <b>TWO</b></div>
</body>
I want to extract the two strings between <div> and </div>. The new line in
the first div-tag drives me crazy.
Can somebody help me out with this? I'm searching for the regular expression
to match this tags.
Since you don't show what you're doing, it's impossible for us to show
you what you're doing wrong. Please post a short-but-complete script
that demonstrates the error. Have you read the Posting Guidelines that
are posted here twice a week?
I'll take a stab in the dark and guess that you are processing the file
line-by-line, but that you're trying to use a pattern match to find a
pattern that spans more than one line. That obviously won't work.
You need to either: 1) Read the entire file into one big scalar, or 2)
(preferably) use a module that was actually designed for HTML parsing,
rather than rolling your own.
1) Several options: (a) read the file into an array and join() it
together with the empty string; (b) set $/ to undef before reading, and
do one read to a scalar; or (c) use the File::Slurp module
2) Several options, including HTML:

arser and HTML::TokeParser. I
recommend the latter. Here's an example using that module:
#!/opt2/perl/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
use Data:

umper;
my $p = HTML::TokeParser->new(\*DATA);
while (my $token = $p->get_tag("div")){
my $val;
while (my $token = $p->get_token()){
last if $token->[-1] eq '</div>';
if ($token->[0] =~ /^[SECD]$/) {
$val .= $token->[-1];
} elsif ($token->[0] eq 'T') {
$val .= $token->[1];
}
}
print "Found: '$val'\n";
}
__DATA__
<body>
<div>MATCH
<b>ONE</b></div>
<p> foobar </p>
<div>MATCH <b>TWO</b></div>
</body>
This outputs:
Found: 'MATCH
<b>ONE</b>'
Found: 'MATCH <b>TWO</b>'
Hope this helps,
Paul Lalli