Loop over regexp groups

J

January Weiner

Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:
O81231 (Q81999) Dehydrogenase alpha subunit

or like that
O81231 123 Q81999
or

gi|O81231||li|Q81999

or, possibly,
O81231; synonyms: Q81999, P89812, O77781

or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

....which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

--
 
D

Dr.Ruud

January Weiner schreef:
I am matching a regexp with an a priori unknown number of groups. I
would like to loop over all groups that were matched.

Use the g-modifier, see perlre.
Or use split + grep.
 
M

micmath

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}


use strict;
use warnings;

my %styles = (
style1 => qr/([A-Z]\d{5})/,
style2 => qr/([A-Z]{3}_[A-Z]{5})/,
);

my $header1 = "O81231 (Q81999) Dehydrogenase alpha subunit";
my $header2 = "O81231 (HBA_HUMAN) Dehydrogenase alpha subunit";

sub get_id {
my ($header, $style) = @_;
my ($id) = $header =~ m/$style/;
return $id;
}

print get_id($header1, $styles{style1}), "\n"; # prints Q81999
print get_id($header2, $styles{style2}), "\n"; # prints HBA_HUMAN

__END__

I'm not sure I entirely understand your question, but if you want to
store regular expressions in a structure you can loop over, you just
need the qr// operator. If I'm off base, just clarify what you mean and
I'll try again, but I hope that helps! :)

Regards,
Michael
http://www.perlcircus.org/
 
A

anno4000

January Weiner said:
Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

Very easy. Assuming the regex (with captures) in $re, and the string to
match in $_ (untested):

my @groups = m/$re/;

A regex in list context returns all its captures.
for( @groups ) {
process_match( $_ ) ;
}

Right on. Even

process_match( $_) for m/$re/;

would work.

Anno

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:
O81231 (Q81999) Dehydrogenase alpha subunit

or like that
O81231 123 Q81999
or

gi|O81231||li|Q81999

or, possibly,
O81231; synonyms: Q81999, P89812, O77781

or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

...which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

--
 
J

January Weiner

I'm not sure I entirely understand your question, but if you want to
store regular expressions in a structure you can loop over, you just
need the qr// operator. If I'm off base, just clarify what you mean and
I'll try again, but I hope that helps! :)

Sorry, I think I did not get it clear. Assume the following:

- you have a regular expression
- the regular expression contains an unknown number of groups enclosed in
parentheses
- you would like to print these groups, one by one.


If you know exactly that there are two groups, you can do the following:

$a =~ /(one) (two)/ ;

print "group one: $1\n" ;
print "group two: $2\n" ;

My question is: what can I do if I do not know the number of the groups?
For example, the regexp can be
/(one) (two)/

or it can be
/(one) (two) (three)/

or even
/(one) (two) (three) (four)/

My question rephrased: how can I loop through the automatic variables $1
.... $n, where n is the number of groups in the regexp?

Regards,
j.

--
 
M

Mumia W. (reading news)

Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:
O81231 (Q81999) Dehydrogenase alpha subunit

or like that
O81231 123 Q81999
or

gi|O81231||li|Q81999

or, possibly,
O81231; synonyms: Q81999, P89812, O77781

or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

....which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

This

my @ids = /([[:upper:]\d]{3,})/g;

is a possibility.
 
D

Dr.Ruud

(e-mail address removed)-berlin.de schreef:
Very easy. Assuming the regex (with captures) in $re, and the string
to match in $_ (untested):

my @groups = m/$re/;

A regex in list context returns all its captures.


I think he meant to have only one (multi-format) capture in $re, so I am
missing the g-modifier.

$ perl -wle'
$_ = "a b c";
@_ = /([a-z])/;
print "@_"
'
a

$ perl -wle'
$_ = "a b c";
@_ = /([a-z])/g;
print "@_"
'
a b c
 
A

anno4000

January Weiner said:
Yes! That's it. Thank you so much. (very intuitive, when you think of it!)

It is. The behavior varies slightly with whether the regex has captures
and/or the /g modifier, but the variations usually do what you mean.

In fact, list assignment is the preferred method of accessing regex
captures. You avoid the special package variables $1, $2, ... and
their scoping issues. You can give the captures meaningful names,
individually or collectively. And, (your case), you don't have to
know in advance how many captures there are.

The only case where you can't avoid $1 etc. is when you need the
behavior of /g in scalar context and have captures.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top