Checking how many items have been captured in a pattern match

  • Thread starter niall.macpherson
  • Start date
N

niall.macpherson

I know this should be a fairly simple question but I have been
searching for a while and can't find an obvious answer.

When testing a pattern , I tended to use the method shown in METHOD 1
in the code below., i.e use the temporary variables $1, $2, $3 etc.

METHOD 2 seems to be better from my point of view as it avoids the
code being littered with lots of $1, $2, $3 ... variables . However
the multiple defined() calls make it look a bit unwieldy.

Am I missing an obviously more elegant way of checking that all three
values have been captured ? I do not want to put the results in an
array as I need the variable names to be meaningful to other people
(although obviusly a hash may be possible).

I am only interested in a full match , i.e all 3 (or however many)
values captured .

use strict;
use warnings;

my $teststr = '123 456 abcd';
my ($var1, $var2, $var3);

### METHOD 1
if($teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/)
{
($var1, $var2, $var3) = ($1, $2, $3);
print STDERR "\nMETHOD 1" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 1 No Match\n"
}

### METHOD 2
($var1, $var2, $var3) = $teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/;
if(defined($var1) && defined($var2) && defined($var3))
{
print STDERR "\nMETHOD 2" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 2 No Match\n"
}
 
J

John W. Krahn

I know this should be a fairly simple question but I have been
searching for a while and can't find an obvious answer.

When testing a pattern , I tended to use the method shown in METHOD 1
in the code below., i.e use the temporary variables $1, $2, $3 etc.

METHOD 2 seems to be better from my point of view as it avoids the
code being littered with lots of $1, $2, $3 ... variables . However
the multiple defined() calls make it look a bit unwieldy.

Am I missing an obviously more elegant way of checking that all three
values have been captured ? I do not want to put the results in an
array as I need the variable names to be meaningful to other people
(although obviusly a hash may be possible).

I am only interested in a full match , i.e all 3 (or however many)
values captured .

use strict;
use warnings;

my $teststr = '123 456 abcd';
my ($var1, $var2, $var3);

### METHOD 1
if($teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/)
{
($var1, $var2, $var3) = ($1, $2, $3);
print STDERR "\nMETHOD 1" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 1 No Match\n"
}

### METHOD 2
($var1, $var2, $var3) = $teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/;
if(defined($var1) && defined($var2) && defined($var3))
{
print STDERR "\nMETHOD 2" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 2 No Match\n"
}

You could do it like this:

if ( 3 == ( ( $var1, $var2, $var3 ) = $teststr =~ /(\d+)\s+(\d+)\s+(\w*)/ ) )
{
print STDERR "\nMETHOD 3 var1 = $var1 var2 = $var2 var3 = $var3\n";
}



John
 
A

Anno Siegel

I know this should be a fairly simple question but I have been
searching for a while and can't find an obvious answer.

When testing a pattern , I tended to use the method shown in METHOD 1
in the code below., i.e use the temporary variables $1, $2, $3 etc.

METHOD 2 seems to be better from my point of view as it avoids the
code being littered with lots of $1, $2, $3 ... variables .

Right. Unless you need the behavior of matches in scalar context
(with /g, for instance), catching captures in list context is much
preferable.
However
the multiple defined() calls make it look a bit unwieldy.

Am I missing an obviously more elegant way of checking that all three
values have been captured ? I do not want to put the results in an
array as I need the variable names to be meaningful to other people
(although obviusly a hash may be possible).

I am only interested in a full match , i.e all 3 (or however many)
values captured .

use strict;
use warnings;

my $teststr = '123 456 abcd';
my ($var1, $var2, $var3);

### METHOD 1
if($teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/)
{
($var1, $var2, $var3) = ($1, $2, $3);
print STDERR "\nMETHOD 1" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 1 No Match\n"
}

### METHOD 2
($var1, $var2, $var3) = $teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/;
if(defined($var1) && defined($var2) && defined($var3))
{
print STDERR "\nMETHOD 2" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 2 No Match\n"
}

You have the same problem with both methods. If the match in list context
returns an undefined value, the corresponding $n variable would also
be undefined. If an undefined match is possible, you'll have to test
either way. A little more compact (untested):

unless ( grep !defined, $var1, $var2, $var3 ) {
# they're all defined

On the other hand, most captures can't be undefined after a successful
match. The only case that comes to mind is when a pair of parens is part
of an alternative that wasn't used in the match. There may be others.

In your case, when the regex matches, all captures will be defined, though
some (well, one, $var3) may be empty.

So, as a rule, look at your regex and identify the captures that
can possibly come back undefined. Then test only those. In the
concrete case, you have nothing to test.

Anno
 
A

Anno Siegel

John W. Krahn said:
I know this should be a fairly simple question but I have been
searching for a while and can't find an obvious answer.

When testing a pattern , I tended to use the method shown in METHOD 1
in the code below., i.e use the temporary variables $1, $2, $3 etc.

METHOD 2 seems to be better from my point of view as it avoids the
code being littered with lots of $1, $2, $3 ... variables . However
the multiple defined() calls make it look a bit unwieldy.

Am I missing an obviously more elegant way of checking that all three
values have been captured ? I do not want to put the results in an
array as I need the variable names to be meaningful to other people
(although obviusly a hash may be possible).

I am only interested in a full match , i.e all 3 (or however many)
values captured .

use strict;
use warnings;

my $teststr = '123 456 abcd';
my ($var1, $var2, $var3);
[...]
### METHOD 2
($var1, $var2, $var3) = $teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/;
if(defined($var1) && defined($var2) && defined($var3))
{
print STDERR "\nMETHOD 2" , ' var1 = ' , $var1 , ' var2 = ' , $var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 2 No Match\n"
}

You could do it like this:

if ( 3 == ( ( $var1, $var2, $var3 ) = $teststr =~ /(\d+)\s+(\d+)\s+(\w*)/ ) )
{
print STDERR "\nMETHOD 3 var1 = $var1 var2 = $var2 var3 = $var3\n";
}

No, John, that's wrong. The match with three captures will always return
three values, defined or undefined. The test will not indicate any
undefined values. Try "123 456" against /(\d+)\s+(\d+)(?:\s+(\w*))?/.

unless ( grep !defined, ( $var1, ...

should work.

Anno
 
A

A. Sinan Unur

(e-mail address removed)-berlin.de (Anno Siegel) wrote in
John W. Krahn said:
I know this should be a fairly simple question but I have been
searching for a while and can't find an obvious answer.

When testing a pattern , I tended to use the method shown in METHOD
1 in the code below., i.e use the temporary variables $1, $2, $3
etc.

METHOD 2 seems to be better from my point of view as it avoids the
code being littered with lots of $1, $2, $3 ... variables .
However the multiple defined() calls make it look a bit unwieldy.

Am I missing an obviously more elegant way of checking that all
three values have been captured ? I do not want to put the results
in an array as I need the variable names to be meaningful to other
people (although obviusly a hash may be possible).

I am only interested in a full match , i.e all 3 (or however many)
values captured .

use strict;
use warnings;

my $teststr = '123 456 abcd';
my ($var1, $var2, $var3);
[...]
### METHOD 2
($var1, $var2, $var3) = $teststr =~ m/(\d+)\s+(\d+)\s+(\w*)/;
if(defined($var1) && defined($var2) && defined($var3))
{
print STDERR "\nMETHOD 2" , ' var1 = ' , $var1 , ' var2 = ' ,
$var2 ,
' var3 = ' , $var3, "\n";
}
else
{
print STDERR "\nMETHOD 2 No Match\n"
}

You could do it like this:

if ( 3 == ( ( $var1, $var2, $var3 ) = $teststr =~
/(\d+)\s+(\d+)\s+(\w*)/ ) ) {
print STDERR "\nMETHOD 3 var1 = $var1 var2 = $var2 var3 =
$var3\n";
}

No, John, that's wrong. The match with three captures will always
return three values, defined or undefined. The test will not indicate
any undefined values. Try "123 456" against
/(\d+)\s+(\d+)(?:\s+(\w*))?/.

unless ( grep !defined, ( $var1, ...

should work.

Mere mortals such as myself might be more comfortable without the
double-negation:

if ( 3 == grep { defined } ($var1, ...

;-) (desperately looking for something useful to say).


Also, won't (\w*) set $3 to the empty string rather than an undefined
value? So, may be something like this:

#!/usr/bin/perl

use strict;
use warnings;

while (my $s = <DATA>) {
my @matched = ( $s =~ m{ \A (\d{3}) \s+ (\d{3}) \s* (\w*) }x );
if (3 == grep { defined and $_ ne q{} } @matched) {
my ($var1, $var2, $var3) = @matched;
...
}
}
__END__
111 222
333 444 this
111 222

But, then again, if the match succeeded, only $var3 may be empty. So,
why test $var1 and $var2 (in the original code)?

#!/usr/bin/perl

use strict;
use warnings;

while (my $s = <DATA>) {
if ( $s =~ m{ \A (\d{3}) \s+ (\d{3}) \s* (\w*) }x
and $3 ne q{} ) {
my ($var1, $var2, $var3) = ($1, $2, $3);
# do something
}
}

__END__
111 222
333 444 this
111 222





Sinan
 
A

Anno Siegel

A. Sinan Unur said:
(e-mail address removed)-berlin.de (Anno Siegel) wrote in
[snip]
No, John, that's wrong. The match with three captures will always
return three values, defined or undefined. The test will not indicate
any undefined values. Try "123 456" against
/(\d+)\s+(\d+)(?:\s+(\w*))?/.

unless ( grep !defined, ( $var1, ...

should work.

Mere mortals such as myself might be more comfortable without the
double-negation:

if ( 3 == grep { defined } ($var1, ...

;-) (desperately looking for something useful to say).


Also, won't (\w*) set $3 to the empty string rather than an undefined
value? So, may be something like this:

Yes, the original example (/(\d+)\s+(\d+)\s+(\w*)/) would never
return an undefined capture if it matched at all. Captures that
can be undefined are not all that common, they happen when a pair
of parentheses is in an optional part of the regex.
#!/usr/bin/perl

use strict;
use warnings;

while (my $s = <DATA>) {
my @matched = ( $s =~ m{ \A (\d{3}) \s+ (\d{3}) \s* (\w*) }x );
if (3 == grep { defined and $_ ne q{} } @matched) {
my ($var1, $var2, $var3) = @matched;
...
}
}
__END__
111 222
333 444 this
111 222

The OPs question was about defined-ness, not about empty matches,
and I think that was deliberate. Empty matches are much more
common.
But, then again, if the match succeeded, only $var3 may be empty. So,
why test $var1 and $var2 (in the original code)?

#!/usr/bin/perl

use strict;
use warnings;

while (my $s = <DATA>) {
if ( $s =~ m{ \A (\d{3}) \s+ (\d{3}) \s* (\w*) }x
and $3 ne q{} ) {
my ($var1, $var2, $var3) = ($1, $2, $3);
# do something
}
}

__END__
111 222
333 444 this
111 222

Right. It usually pays to look at the individual captures and only
test those that *can* return the Wrong Thing, whatever that is in the
particular case.

Anno
 
A

A. Sinan Unur

(e-mail address removed)-berlin.de (Anno Siegel) wrote in [email protected]:
A. Sinan Unur said:
(e-mail address removed)-berlin.de (Anno Siegel) wrote in
(e-mail address removed) wrote:
[snip]
You could do it like this:

if ( 3 == ( ( $var1, $var2, $var3 ) = $teststr =~
/(\d+)\s+(\d+)\s+(\w*)/ ) ) {
print STDERR "\nMETHOD 3 var1 = $var1 var2 = $var2 var3 =
$var3\n";
}

No, John, that's wrong. The match with three captures will always
return three values, defined or undefined. The test will not indicate
any undefined values. Try "123 456" against
/(\d+)\s+(\d+)(?:\s+(\w*))?/.

unless ( grep !defined, ( $var1, ...

should work.

Mere mortals such as myself might be more comfortable without the
double-negation:

if ( 3 == grep { defined } ($var1, ...

;-) (desperately looking for something useful to say).


Also, won't (\w*) set $3 to the empty string rather than an undefined
value? So, may be something like this:

Yes, the original example (/(\d+)\s+(\d+)\s+(\w*)/) would never
return an undefined capture if it matched at all. Captures that
can be undefined are not all that common, they happen when a pair
of parentheses is in an optional part of the regex.

Of course ... It is just too early here I guess.

Thanks.

Sinan
 
N

niall.macpherson

Anno said:
In your case, when the regex matches, all captures will be defined, though
some (well, one, $var3) may be empty.

So, as a rule, look at your regex and identify the captures that
can possibly come back undefined. Then test only those. In the
concrete case, you have nothing to test.

Anno

I think what I was missing was the fundamental point that if the line
I am processing matches the regex then all the captures will be
defined. Obviously the example I gave was very simple.If one of the
captured values happens to be an empty string then that is something I
just have to check. Therefore METHOD 2 seems fine to me since the
undefs are not neccessary as you pointed out

In the real world I am parsing a log file which contains a lot of
varied SQL statements. I want to identify those which are inserting /
updating into a particular table and capture and analyse the values.

So , a more realistic example of what I am trying to do is as follows ,
which now works using method2

-----------------------------------------------------------------------------------------------------------------------
use strict;
use warnings;

while(<DATA>)
{
my $str = $_; ## Since in real life value will be in variable not $_

## METHOD 2
print STDERR "\n", 'METHOD 2 ', $str;
my ($fids2, $vals2) =
$str =~ /INSERT\s*INTO\s*bond\s*
\((.*?)\) # Non greedy match for first set parens
.*? # Any other stuff up to the next open paren non greedy
\((.*)\)/x # Greedy match for second set parens
;
if(defined($fids2) && defined($vals2))
{
## We got a match
print STDERR "\nValues are\n", $fids2, "\n", $vals2;
}
else
{
print STDERR "\n", 'Values are not defined';
}
}
__END__
INSERT INTO bond (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);
INSERT INTO bond (a,b,c,d,e,f,g) VALUES rubbish;
UPDATE issue SET (a,b,c) = (3,4,5);
--------------------------------------------------------------------------------------------------------------------------

One final question here , if my regexp had a large number of captures
would there be any overhead using this method if the match failed late
on ? Since less than 1% of the lines I am searching for actually match
the pattern I would like to keep overhead to a minimum.

I can see from the above example the second test

INSERT INTO bond (a,b,c,d,e,f,g) VALUES rubbish;

causes both $fids2 and $vals2 to be undefined so I assume there is
minimal overhead.
 
D

DJ Stunks

<snip>
Am I missing an obviously more elegant way of checking that all three
values have been captured ? I do not want to put the results in an
array as I need the variable names to be meaningful to other people
(although obviusly a hash may be possible).

I am only interested in a full match , i.e all 3 (or however many)
values captured .

<methods snipped>

um, none of your capturing parenthesis had ?'s, so am I missing
something? or:

C:\tmp>cat tmp.pl
#!/usr/bin/perl

use strict;
use warnings;

while (<DATA>) {
print "Line $.";
if ( my ($var1,$var2) = m{ (\w+) : (\w+) }x ) {
print " matched.\n";
} else {
print " didn't match.\n";
}
}

__END__
this:match
this nomatch

C:\tmp>tmp.pl
Line 1 matched.
Line 2 didn't match.

?

-jp
 
X

Xicheng

I think what I was missing was the fundamental point that if the line
I am processing matches the regex then all the captures will be
defined. Obviously the example I gave was very simple.If one of the
captured values happens to be an empty string then that is something I
just have to check. Therefore METHOD 2 seems fine to me since the
undefs are not neccessary as you pointed out

In the real world I am parsing a log file which contains a lot of
varied SQL statements. I want to identify those which are inserting /
updating into a particular table and capture and analyse the values.

So , a more realistic example of what I am trying to do is as follows ,
which now works using method2

-----------------------------------------------------------------------------------------------------------------------
use strict;
use warnings;

while(<DATA>)
{
my $str = $_; ## Since in real life value will be in variable not $_

## METHOD 2
print STDERR "\n", 'METHOD 2 ', $str;
my ($fids2, $vals2) =
$str =~ /INSERT\s*INTO\s*bond\s*
\((.*?)\) # Non greedy match for first set parens
.*? # Any other stuff up to the next open paren non greedy
\((.*)\)/x # Greedy match for second set parens
;
if(defined($fids2) && defined($vals2))
{
## We got a match
print STDERR "\nValues are\n", $fids2, "\n", $vals2;
}
else
{
print STDERR "\n", 'Values are not defined';
}
}
__END__
INSERT INTO bond (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);
INSERT INTO bond (a,b,c,d,e,f,g) VALUES rubbish;
UPDATE issue SET (a,b,c) = (3,4,5);
--------------------------------------------------------------------------------------------------------------------------

One final question here , if my regexp had a large number of captures
would there be any overhead using this method if the match failed late
on ? Since less than 1% of the lines I am searching for actually match
the pattern I would like to keep overhead to a minimum.

I would use a temporary array like:
---------------------
while(<DATA>)
{
my @tmp = /INSERT\s*INTO\s*bond\s*
\((.*?)\) # Non greedy match for first set parens
.*? # Any other stuff up to the next open
paren non greedy
\((.*)\)/x # Greedy match for second set parens
;
if (@tmp == 2) {
my ($fids2, $vals2) = @tmp;
#do something on $fids2 and $vals2;
print STDERR "\nValues are\n", $fids2, "\n", $vals2;
} else {
print STDERR "\n",'Values are not defined';
}
}
__END__
INSERT INTO bond (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);
INSERT INTO bond (a,b,c,d,e,f,g) VALUES rubbish;
UPDATE issue SET (a,b,c) = (3,4,5);

Best,
Xicheng
 
A

Anno Siegel

I think what I was missing was the fundamental point that if the line
I am processing matches the regex then all the captures will be
defined. Obviously the example I gave was very simple.If one of the
captured values happens to be an empty string then that is something I
just have to check. Therefore METHOD 2 seems fine to me since the
undefs are not neccessary as you pointed out

In the real world I am parsing a log file which contains a lot of
varied SQL statements. I want to identify those which are inserting /
updating into a particular table and capture and analyse the values.

So , a more realistic example of what I am trying to do is as follows ,
which now works using method2

-----------------------------------------------------------------------------------------------------------------------
use strict;
use warnings;

while(<DATA>)
{
my $str = $_; ## Since in real life value will be in variable not $_

## METHOD 2
print STDERR "\n", 'METHOD 2 ', $str;
my ($fids2, $vals2) =
$str =~ /INSERT\s*INTO\s*bond\s*
\((.*?)\) # Non greedy match for first set parens
.*? # Any other stuff up to the next open paren non greedy
\((.*)\)/x # Greedy match for second set parens
;
if(defined($fids2) && defined($vals2))
{
## We got a match
print STDERR "\nValues are\n", $fids2, "\n", $vals2;
}
else
{
print STDERR "\n", 'Values are not defined';
}
}
__END__
INSERT INTO bond (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);
INSERT INTO bond (a,b,c,d,e,f,g) VALUES rubbish;
UPDATE issue SET (a,b,c) = (3,4,5);

Like your former example, the regex doesn't contain any optional
parts, so if it matches at all both captures will be defined.
You could write

if ( my ($fids2, $vals2) = $str =~ /.../ ) {
# can assume that $fids2 and $vals2 are defined here
}

You can, of course, test definedness to see *if* the regex has
matched (testing any one of $fids2 or $vals2 would do), but that's
a little roundabout.
--------------------------------------------------------------------------------------------------------------------------

One final question here , if my regexp had a large number of captures
would there be any overhead using this method if the match failed late
on ? Since less than 1% of the lines I am searching for actually match
the pattern I would like to keep overhead to a minimum.

What is "this method"? Extracting captures in list context? I
don't think it has any impact in case of a failed match, there
are no captures to extract.
I can see from the above example the second test

INSERT INTO bond (a,b,c,d,e,f,g) VALUES rubbish;

causes both $fids2 and $vals2 to be undefined so I assume there is
minimal overhead.

It causes the regex to not match. $fids2 and $vals2 are never assigned
to in this case. They are *not* assigned undefined values extracted
from the match operation. That can only happen with a regex that has
optional parts. [1]

Anno

[1] Well, you could also have a capture inside a negative lookahead or
lookbehind. That will never return anything *but* undef, so it's
unlikely to happen in real code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top