small regexp problem

David Morel · Dec 30, 2003

Hi all,

It's been a while since I've used regular expressions, and I'd like a
bit of help.

I have a string of html --- $html. What I want to do is to isolate the
substrings that are in between a particular tag... say between and
.

So if $html = "asdf asdf asdf foo asdf asdf asdf bar", I
would like to somehow get "foo" and "bar" into an array.

This seems like it would be easy with the appropriate regexp.

Thanks!

Kris Jenkins · Dec 30, 2003

David said:
Hi all,

It's been a while since I've used regular expressions, and I'd like a
bit of help.

I have a string of html --- $html. What I want to do is to isolate the
substrings that are in between a particular tag... say between and
.

So if $html = "asdf asdf asdf foo asdf asdf asdf bar", I
would like to somehow get "foo" and "bar" into an array.

This seems like it would be easy with the appropriate regexp.

Thanks!

Have a look at the document HTML::Tree:Scanning, under the "Scanning
HTML Trees" heading. It gives a couple of recipes, suggests why regexs
may be a little fragile for this task, and why HTML::TreeBuilder _might_
be better*. (You can get it by CPANing HTML::Tree.)

* For a given value of 'better'.

Kris

Gunnar Hjalmarsson · Dec 30, 2003

David said:
I have a string of html --- $html. What I want to do is to isolate
the substrings that are in between a particular tag... say between
 and .

So if $html = "asdf asdf asdf foo asdf asdf asdf bar",
I would like to somehow get "foo" and "bar" into an array.

This seems like it would be easy with the appropriate regexp.

It rather seems like you should explore one of the modules for parsing
HTML, such as HTML:

arser.

But I still had to play with a regex... I used the one from

perldoc -q "remove HTML"

as a starting point for writing a sub, that captures the substrings in
a reference to a hash of arrays:

sub extract {
my ($html, $elements) = @_;
my %substrings;
for my $elem (@$elements) {
while ( $$html =~ m{
<\s*($elem)\b(?:[^>'"]*|(['"]).*?\2)*>
(.+?)
<\s*/\s*$elem\s*>}gisx ) {
push @{$substrings{$1}}, $3;
}
}
return \%substrings;
}

my $html = <<HTML;
asdf asdf asdf foo asdf asdf asdf bar
<a href="http://search.cpan.org/">search.cpan.org</a>
HTML

my $substrings = extract( \$html, [ qw/a b/ ] );

for ( keys %$substrings ) {
print "Element: $_\n";
for ( @{ $substrings->{$_} } ) {
print " $_\n";
}
print "\n";
}

Outputs:
Element: a
search.cpan.org

Element: b
foo
bar

Tad McClellan · Dec 30, 2003

David Morel said:
I have a string of html

This seems like it would be easy with the appropriate regexp.

Then you weren't paying attention when you read the Perl FAQs
about HTML.

perldoc -q HTML

How do I remove HTML from a string?

wherein there are examples that make it hard rather than easy.

Use a module that understands HTML data when you need to
process HTML data.

The distinction between a java applet and an application	1	Jan 4, 2023
Logic Problem with BigInteger Method	2	Aug 26, 2023
Simple database front-end for simple small business	7	Apr 17, 2021
<small> tags in html5	21	Nov 5, 2013
Perl RegExp question	20	Apr 19, 2011
Regexp problem	1	Feb 28, 2004
Fading effect between play and play-over and pause and pause-over	0	Oct 16, 2021
Big problem I need to solve with some unix utils	1	Jun 19, 2022

small regexp problem

David Morel

Kris Jenkins

Gunnar Hjalmarsson

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads