small regexp problem

D

David Morel

Hi all,

It's been a while since I've used regular expressions, and I'd like a
bit of help.

I have a string of html --- $html. What I want to do is to isolate the
substrings that are in between a particular tag... say between <b> and
</b>.

So if $html = "asdf asdf asdf <b>foo</b> asdf asdf asdf <b>bar</b>", I
would like to somehow get "foo" and "bar" into an array.

This seems like it would be easy with the appropriate regexp.

Thanks!
 
K

Kris Jenkins

David said:
Hi all,

It's been a while since I've used regular expressions, and I'd like a
bit of help.

I have a string of html --- $html. What I want to do is to isolate the
substrings that are in between a particular tag... say between <b> and
</b>.

So if $html = "asdf asdf asdf <b>foo</b> asdf asdf asdf <b>bar</b>", I
would like to somehow get "foo" and "bar" into an array.

This seems like it would be easy with the appropriate regexp.

Thanks!

Have a look at the document HTML::Tree:Scanning, under the "Scanning
HTML Trees" heading. It gives a couple of recipes, suggests why regexs
may be a little fragile for this task, and why HTML::TreeBuilder _might_
be better*. (You can get it by CPANing HTML::Tree.)

* For a given value of 'better'.

Kris
 
G

Gunnar Hjalmarsson

David said:
I have a string of html --- $html. What I want to do is to isolate
the substrings that are in between a particular tag... say between
<b> and </b>.

So if $html = "asdf asdf asdf <b>foo</b> asdf asdf asdf <b>bar</b>",
I would like to somehow get "foo" and "bar" into an array.

This seems like it would be easy with the appropriate regexp.

It rather seems like you should explore one of the modules for parsing
HTML, such as HTML::parser.

But I still had to play with a regex... I used the one from

perldoc -q "remove HTML"

as a starting point for writing a sub, that captures the substrings in
a reference to a hash of arrays:

sub extract {
my ($html, $elements) = @_;
my %substrings;
for my $elem (@$elements) {
while ( $$html =~ m{
<\s*($elem)\b(?:[^>'"]*|(['"]).*?\2)*>
(.+?)
<\s*/\s*$elem\s*>}gisx ) {
push @{$substrings{$1}}, $3;
}
}
return \%substrings;
}

my $html = <<HTML;
asdf asdf asdf <b>foo</b> asdf asdf asdf <b>bar</b>
<a href="http://search.cpan.org/">search.cpan.org</a>
HTML

my $substrings = extract( \$html, [ qw/a b/ ] );

for ( keys %$substrings ) {
print "Element: $_\n";
for ( @{ $substrings->{$_} } ) {
print " $_\n";
}
print "\n";
}

Outputs:
Element: a
search.cpan.org

Element: b
foo
bar
 
T

Tad McClellan

David Morel said:
I have a string of html
This seems like it would be easy with the appropriate regexp.


Then you weren't paying attention when you read the Perl FAQs
about HTML. :)


perldoc -q HTML

How do I remove HTML from a string?


wherein there are examples that make it hard rather than easy.

Use a module that understands HTML data when you need to
process HTML data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top