How to capture repeated subpatterns?

M

Martin Larsen

Hi,

One thing that puzzles me is whether it is possible to capture each
repeated subpattern.

Here is a simple regex that will match a string consisting of individual
words:

(?:(\w+) ?)+

If the subject is "three little mice" the regex will match the string,
and the subpattern match ($1) would be "mice". That is, the last
subpattern match will be returned.

But what if I wanted to extract all submatches?
So that $1 = three, $2 = little and $3 = mice

It could be done by repeating an expression like this:

(?:(\w+) ?)?(?:(\w+) ?)?(?:(\w+) ?)?(?:(\w+) ?)?

as many times as needed. Each "(?:(\w+) ?)?" will match one optional
word, separated by a space from the next word.

But that is not an elegant solution!

So my question is: Is it in fact possible to repeat AND capture subpatterns?

Martin
 
G

Gunnar Hjalmarsson

Martin said:
But what if I wanted to extract all submatches?
So that $1 = three, $2 = little and $3 = mice

You don't need to use the dollardigit variables.

my @matches = 'three little mice' =~ /(\w+) ?/g;
 
T

Tad McClellan

Mirco Wahab said:
my @words = $text =~ /\b(\w+)\b/g; # <== /g repeats matches


I'm pretty sure that this is equivalent:

my @words = $text =~ /(\w+)/g;

Isn't it?
 
M

Martin Larsen

Thanks for your responses, but you both misunderstand what I aim at.

I simplified the expression to make it more clear, but I realize that it
does the opposite by confusing you into thinking that all I need is to
capture all the words.

That is not the case at all.

In fact, it is a matter of extracting parameters from an expression. So
let me try again, this time with some html code.

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

I need to match the full expression and also to capture the individual
attributes.

The declation shown could be matched by

<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

That gives:

submatch 1 = src="mypic.jpg"
submatch 2 = class="myclass"
submatch 3 = alt="description"
submatch 4 = title="some text"

Fine, but what if the html code had yet another attribute, like:

<img src="mypic.jpg" class="myclass" alt="description" title="some text"
border="0">

Then the expression wouldn't match anymore. Of course I could just add a
bunch of the subpatterns to the expression, because since I have made
them optional the regex will match even if there are less attributes
than subpatterns.

For example, this would match up to 8 attributes:
<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?
*(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

But as you see, it is certainly not elegant.

Well, then ... if it was just a matter of matching an arbitrary number
of attributes, the regex would simply be:

<img(?: *(\w+="[^"]*"))+>

But the only submatch would then be border="0". Only the last submatch
is captured.

Instead I need to extract ALL the parameters individually, so the
million dollar question is:

Is it possible to define a regex using repeated subpatterns in such a
way that all subpatterns can be captured individually?

The real usage is a little more complicated than matching html
attributes, and yes: I could use a parser etc!

But I would really like to know if my quest is possible!

Martin
 
M

Martin Larsen

Thanks for your responses.

Now, my example was simplified for clarity, but I am still unsure if
your suggestions can accomplish what I need.

In fact, it is a matter of extracting parameters from an expression. So
let me try again, this time with some html code.

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

I need to match the full expression and also to capture the individual
attributes.

The declation shown could be matched by

<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

That gives:

submatch 1 = src="mypic.jpg"
submatch 2 = class="myclass"
submatch 3 = alt="description"
submatch 4 = title="some text"

Fine, but what if the html code had yet another attribute, like:

<img src="mypic.jpg" class="myclass" alt="description" title="some text"
border="0">

Then the expression wouldn't match anymore. Of course I could just add a
bunch of the subpatterns to the expression, because since I have made
them optional the regex will match even if there are less attributes
than subpatterns.

For example, this would match up to 8 attributes:
<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?
*(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

But as you see, it is certainly not elegant.

Well, then ... if it was just a matter of matching an arbitrary number
of attributes, the regex would simply be:

<img(?: *(\w+="[^"]*"))+>

But the only submatch would then be border="0". Only the last submatch
is captured.

Instead I need to extract ALL the parameters individually.

Is this also possible using the /g modifier, and if so, how?

Thanks,
Martin
 
T

Tad McClellan

Martin Larsen said:
In fact, it is a matter of extracting parameters from an expression. So
let me try again, this time with some html code.


HTML does not have expressions, perhaps because it is not a
programming language?

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some text">

I need to match the full expression and also to capture the individual
attributes.

The declation shown could be matched by
[snip]

Fine, but what if the html code had yet another attribute, like:

[snip]

Or what if it had exactly the same attributes, as in:

<img src='mypic.jpg' class='myclass' alt='description' title='some text'>

Or what if it is not an image tag at all?

<!--
<img src="mypic.jpg" class="myclass" alt="description" title="some text">
-->

Then it will match when it shouldn't.

Then the expression wouldn't match anymore. Of course I could just add a
bunch of the subpatterns to the expression,
[snip]

But as you see, it is certainly not elegant.


Using a regular expression to parse an arbitrary context free
language, such as HTML, is almost always a mistake that will result
in fragile and perhaps incorrect implementations.

It is only feasible if you can place many restrictions on what
your HTML data can contain.

Instead I need to extract ALL the parameters individually.


Consider doing a proper parse using a module that knows how
to parse HTML.
 
D

Dr.Ruud

Martin Larsen schreef:
Thanks for your responses, but you both misunderstand what I aim at.

I simplified the expression to make it more clear, but I realize that
it does the opposite by confusing you into thinking that all I need
is to capture all the words.

That is not the case at all.

In fact, it is a matter of extracting parameters from an expression.
So let me try again, this time with some html code.

Here is an image declaration:

<img src="mypic.jpg" class="myclass" alt="description" title="some
text">

I need to match the full expression and also to capture the individual
attributes.

The declation shown could be matched by

<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

That gives:

submatch 1 = src="mypic.jpg"
submatch 2 = class="myclass"
submatch 3 = alt="description"
submatch 4 = title="some text"

Fine, but what if the html code had yet another attribute, like:

<img src="mypic.jpg" class="myclass" alt="description" title="some
text" border="0">

Then the expression wouldn't match anymore. Of course I could just
add a bunch of the subpatterns to the expression, because since I
have made them optional the regex will match even if there are less
attributes than subpatterns.

For example, this would match up to 8 attributes:
<img *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?
*(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")? *(\w+="[^"]*")?>

But as you see, it is certainly not elegant.

Well, then ... if it was just a matter of matching an arbitrary number
of attributes, the regex would simply be:

<img(?: *(\w+="[^"]*"))+>

But the only submatch would then be border="0". Only the last submatch
is captured.

Instead I need to extract ALL the parameters individually, so the
million dollar question is:

Is it possible to define a regex using repeated subpatterns in such a
way that all subpatterns can be captured individually?

The real usage is a little more complicated than matching html
attributes, and yes: I could use a parser etc!

But I would really like to know if my quest is possible!

That's a lot of words for such a simple question. The real answer is:
use a parser. And since you write you could, you should.

Read the following aloud, with a good pause at each comma:
To match both the full text, and to capture repeated subpatterns, by
using regular expressions, just use two regular expressions, with the
second depending on the (success of the) first:


#!/usr/bin/perl
use warnings ;
use strict ;

my $re_val = qr/"[^"]*"|'[^']*'/ ;
my $re_attr = qr/\w+(?:=${re_val})?/ ;
# print "re_attr:{${re_attr}}\n\n" ;

my $html = join '', <DATA> ;
print "html:{$html}\n" ;

if ($html =~ m/^\s*<(\w+)([^>]*)>/)
{
my $tag = $1 ;
my @attrs = ($2 =~ m/\s+(${re_attr})/g) ;

print "tag:{$tag}\n" ;
print "{$_}\n" for @attrs ;
}

__DATA__
<img src="mypic.jpg" class="myclass" alt="description"
title="some text" border="0" one='test' two>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,163
Latest member
Sasha15427
Top