Loop over regexp groups

January Weiner · Nov 13, 2006

Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:

O81231 (Q81999) Dehydrogenase alpha subunit

or like that

O81231 123 Q81999
or

gi|O81231||li|Q81999

or, possibly,

O81231; synonyms: Q81999, P89812, O77781

or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

....which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

--

Dr.Ruud · Nov 13, 2006

January Weiner schreef:

I am matching a regexp with an a priori unknown number of groups. I
would like to loop over all groups that were matched.

Use the g-modifier, see perlre.
Or use split + grep.

micmath · Nov 13, 2006

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}

use strict;
use warnings;

my %styles = (
style1 => qr/([A-Z]\d{5})/,
style2 => qr/([A-Z]{3}_[A-Z]{5})/,
);

my $header1 = "O81231 (Q81999) Dehydrogenase alpha subunit";
my $header2 = "O81231 (HBA_HUMAN) Dehydrogenase alpha subunit";

sub get_id {
my ($header, $style) = @_;
my ($id) = $header =~ m/$style/;
return $id;
}

print get_id($header1, $styles{style1}), "\n"; # prints Q81999
print get_id($header2, $styles{style2}), "\n"; # prints HBA_HUMAN

__END__

I'm not sure I entirely understand your question, but if you want to
store regular expressions in a structure you can loop over, you just
need the qr// operator. If I'm off base, just clarify what you mean and
I'll try again, but I hope that helps!

Regards,
Michael
http://www.perlcircus.org/

anno4000 · Nov 13, 2006

January Weiner said:
Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

Very easy. Assuming the regex (with captures) in $re, and the string to
match in $_ (untested):

my @groups = m/$re/;

A regex in list context returns all its captures.

for( @groups ) {
process_match( $_ ) ;
}

Right on. Even

process_match( $_) for m/$re/;

would work.

Anno

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:

O81231 (Q81999) Dehydrogenase alpha subunit

Click to expand...

or like that

O81231 123 Q81999
or

gi|O81231||li|Q81999

Click to expand...

or, possibly,

O81231; synonyms: Q81999, P89812, O77781

Click to expand...

or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

...which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

--

January Weiner · Nov 13, 2006

I'm not sure I entirely understand your question, but if you want to
store regular expressions in a structure you can loop over, you just
need the qr// operator. If I'm off base, just clarify what you mean and
I'll try again, but I hope that helps!

Sorry, I think I did not get it clear. Assume the following:

- you have a regular expression
- the regular expression contains an unknown number of groups enclosed in
parentheses
- you would like to print these groups, one by one.

If you know exactly that there are two groups, you can do the following:

$a =~ /(one) (two)/ ;

print "group one: $1\n" ;
print "group two: $2\n" ;

My question is: what can I do if I do not know the number of the groups?
For example, the regexp can be
/(one) (two)/

or it can be
/(one) (two) (three)/

or even
/(one) (two) (three) (four)/

My question rephrased: how can I loop through the automatic variables $1
.... $n, where n is the number of groups in the regexp?

Regards,
j.

--

January Weiner · Nov 13, 2006

Very easy. Assuming the regex (with captures) in $re, and the string to
match in $_ (untested):

my @groups = m/$re/;

A regex in list context returns all its captures.

Yes! That's it. Thank you so much. (very intuitive, when you think of it!)

j.

--

Mumia W. (reading news) · Nov 13, 2006

Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:

O81231 (Q81999) Dehydrogenase alpha subunit

Click to expand...

or like that

O81231 123 Q81999
or

gi|O81231||li|Q81999

Click to expand...

or, possibly,

O81231; synonyms: Q81999, P89812, O77781

Click to expand...

or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

....which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

This

my @ids = /([[:upper:]\d]{3,})/g;

is a possibility.

Dr.Ruud · Nov 13, 2006

(e-mail address removed)-berlin.de schreef:

Very easy. Assuming the regex (with captures) in $re, and the string
to match in $_ (untested):

my @groups = m/$re/;

A regex in list context returns all its captures.

I think he meant to have only one (multi-format) capture in $re, so I am
missing the g-modifier.

$ perl -wle'
$_ = "a b c";
@_ = /([a-z])/;
print "@_"
'
a

$ perl -wle'
$_ = "a b c";
@_ = /([a-z])/g;
print "@_"
'
a b c

anno4000 · Nov 14, 2006

January Weiner said:
Yes! That's it. Thank you so much. (very intuitive, when you think of it!)

It is. The behavior varies slightly with whether the regex has captures
and/or the /g modifier, but the variations usually do what you mean.

In fact, list assignment is the preferred method of accessing regex
captures. You avoid the special package variables $1, $2, ... and
their scoping issues. You can give the captures meaningful names,
individually or collectively. And, (your case), you don't have to
know in advance how many captures there are.

The only case where you can't avoid $1 etc. is when you need the
behavior of /g in scalar context and have captures.

Anno

loop over list and process into groups	11	Mar 4, 2010
Using the nntplib module to count Google Groups users	3	Oct 27, 2013
perlop doc: Regexp Quote-Like Operators error?	2	Sep 20, 2010
Regular expressions, capture repeated groups	4	Jul 8, 2010
regexp(ing) Backus-Naurish expressions ...	23	Mar 10, 2013
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
ANNOUNCE: Regexp::Common::time	0	Dec 8, 2005
newbie Java regexp question	4	Jul 2, 2007

Loop over regexp groups

January Weiner

Dr.Ruud

micmath

anno4000

January Weiner

January Weiner

Mumia W. (reading news)

Dr.Ruud

anno4000

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads