How to capture repeated subpatterns?


Martin Larsen


One thing that puzzles me is whether it is possible to capture each
repeated subpattern.

Here is a simple regex that will match a string consisting of individual

(?:(\w+) ?)+

If the subject is "three little mice" the regex will match the string,
and the subpattern match ($1) would be "mice". That is, the last
subpattern match will be returned.

But what if I wanted to extract all submatches?
So that $1 = three, $2 = little and $3 = mice

It could be done by repeating an expression like this:

(?:(\w+) ?)?(?:(\w+) ?)?(?:(\w+) ?)?(?:(\w+) ?)?

as many times as needed. Each "(?:(\w+) ?)?" will match one optional
word, separated by a space from the next word.

But that is not an elegant solution!

So my question is: Is it in fact possible to repeat AND capture subpatterns?


Gunnar Hjalmarsson

Martin said:
But what if I wanted to extract all submatches?
So that $1 = three, $2 = little and $3 = mice

You don't need to use the dollardigit variables.

my @matches = 'three little mice' =~ /(\w+) ?/g;

Tad McClellan

Mirco Wahab said:
my @words = $text =~ /\b(\w+)\b/g; # <== /g repeats matches

I'm pretty sure that this is equivalent:

my @words = $text =~ /(\w+)/g;

Isn't it?

Martin Larsen

Thanks for your responses, but you both misunderstand what I aim at.

I simplified the expression to make it more clear, but I realize that it
does the opposite by confusing you into thinking that all I need is to
capture all the words.

That is not the case at all.

In fact, it is a matter of extracting parameters from an expression. So
let me try again, this time with some html code.

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

I need to match the full expression and also to capture the individual

The declation shown could be matched by

<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

That gives:

submatch 1 = src="mypic.jpg"
submatch 2 = class="myclass"
submatch 3 = alt="description"
submatch 4 = title="some text"

Fine, but what if the html code had yet another attribute, like:

<img src="mypic.jpg" class="myclass" alt="description" title="some text"

Then the expression wouldn't match anymore. Of course I could just add a
bunch of the subpatterns to the expression, because since I have made
them optional the regex will match even if there are less attributes
than subpatterns.

For example, this would match up to 8 attributes:
<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?
*(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

But as you see, it is certainly not elegant.

Well, then ... if it was just a matter of matching an arbitrary number
of attributes, the regex would simply be:

<img(?: *(\w+="[^"]*"))+>

But the only submatch would then be border="0". Only the last submatch
is captured.

Instead I need to extract ALL the parameters individually, so the
million dollar question is:

Is it possible to define a regex using repeated subpatterns in such a
way that all subpatterns can be captured individually?

The real usage is a little more complicated than matching html
attributes, and yes: I could use a parser etc!

But I would really like to know if my quest is possible!


Martin Larsen

Thanks for your responses.

Now, my example was simplified for clarity, but I am still unsure if
your suggestions can accomplish what I need.

In fact, it is a matter of extracting parameters from an expression. So
let me try again, this time with some html code.

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

I need to match the full expression and also to capture the individual

The declation shown could be matched by

<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

That gives:

submatch 1 = src="mypic.jpg"
submatch 2 = class="myclass"
submatch 3 = alt="description"
submatch 4 = title="some text"

Fine, but what if the html code had yet another attribute, like:

<img src="mypic.jpg" class="myclass" alt="description" title="some text"

Then the expression wouldn't match anymore. Of course I could just add a
bunch of the subpatterns to the expression, because since I have made
them optional the regex will match even if there are less attributes
than subpatterns.

For example, this would match up to 8 attributes:
<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?
*(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

But as you see, it is certainly not elegant.

Well, then ... if it was just a matter of matching an arbitrary number
of attributes, the regex would simply be:

<img(?: *(\w+="[^"]*"))+>

But the only submatch would then be border="0". Only the last submatch
is captured.

Instead I need to extract ALL the parameters individually.

Is this also possible using the /g modifier, and if so, how?


Tad McClellan

Martin Larsen said:
In fact, it is a matter of extracting parameters from an expression. So
let me try again, this time with some html code.

HTML does not have expressions, perhaps because it is not a
programming language?

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

I need to match the full expression and also to capture the individual

The declation shown could be matched by

Fine, but what if the html code had yet another attribute, like:


Or what if it had exactly the same attributes, as in:

<img src='mypic.jpg' class='myclass' alt='description' title='some text'>

Or what if it is not an image tag at all?

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

Then it will match when it shouldn't.

Then the expression wouldn't match anymore. Of course I could just add a
bunch of the subpatterns to the expression,

But as you see, it is certainly not elegant.

Using a regular expression to parse an arbitrary context free
language, such as HTML, is almost always a mistake that will result
in fragile and perhaps incorrect implementations.

It is only feasible if you can place many restrictions on what
your HTML data can contain.

Instead I need to extract ALL the parameters individually.

Consider doing a proper parse using a module that knows how
to parse HTML.


Martin Larsen schreef:
Thanks for your responses, but you both misunderstand what I aim at.

I simplified the expression to make it more clear, but I realize that
it does the opposite by confusing you into thinking that all I need
is to capture all the words.

That is not the case at all.

In fact, it is a matter of extracting parameters from an expression.
So let me try again, this time with some html code.

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some

I need to match the full expression and also to capture the individual

The declation shown could be matched by

<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

That gives:

submatch 1 = src="mypic.jpg"
submatch 2 = class="myclass"
submatch 3 = alt="description"
submatch 4 = title="some text"

Fine, but what if the html code had yet another attribute, like:

<img src="mypic.jpg" class="myclass" alt="description" title="some
text" border="0">

Then the expression wouldn't match anymore. Of course I could just
add a bunch of the subpatterns to the expression, because since I
have made them optional the regex will match even if there are less
attributes than subpatterns.

For example, this would match up to 8 attributes:
<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?
*(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

But as you see, it is certainly not elegant.

Well, then ... if it was just a matter of matching an arbitrary number
of attributes, the regex would simply be:

<img(?: *(\w+="[^"]*"))+>

But the only submatch would then be border="0". Only the last submatch
is captured.

Instead I need to extract ALL the parameters individually, so the
million dollar question is:

Is it possible to define a regex using repeated subpatterns in such a
way that all subpatterns can be captured individually?

The real usage is a little more complicated than matching html
attributes, and yes: I could use a parser etc!

But I would really like to know if my quest is possible!

That's a lot of words for such a simple question. The real answer is:
use a parser. And since you write you could, you should.

Read the following aloud, with a good pause at each comma:
To match both the full text, and to capture repeated subpatterns, by
using regular expressions, just use two regular expressions, with the
second depending on the (success of the) first:

use warnings ;
use strict ;

my $re_val = qr/"[^"]*"|'[^']*'/ ;
my $re_attr = qr/\w+(?:=${re_val})?/ ;
# print "re_attr:{${re_attr}}\n\n" ;

my $html = join '', <DATA> ;
print "html:{$html}\n" ;

if ($html =~ m/^\s*<(\w+)([^>]*)>/)
my $tag = $1 ;
my @attrs = ($2 =~ m/\s+(${re_attr})/g) ;

print "tag:{$tag}\n" ;
print "{$_}\n" for @attrs ;

<img src="mypic.jpg" class="myclass" alt="description"
title="some text" border="0" one='test' two>

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Latest member

Latest Threads
