Loop over regexp groups

Discussion in 'Perl Misc' started by January Weiner, Nov 13, 2006.

  1. Hello,

    I am matching a regexp with an a priori unknown number of groups. I would
    like to loop over all groups that were matched. For example:

    /(\w+)\s(\w+)/ ;
    #or
    /(\w+)\s(\w+)\s(\w+)/ ;
    # or something else

    @groups = ...???

    for( @groups ) {
    process_match( $_ ) ;
    }

    Of course, the above example is simplifying reality and could be replaced
    by split(). Here are more details on the problem:

    I am processing protein sequence files in the FASTA format. Depending on
    the database, the FASTA headers may look like that:

    >O81231 (Q81999) Dehydrogenase alpha subunit


    or like that

    > O81231 123 Q81999


    or

    >gi|O81231||li|Q81999


    or, possibly,

    >O81231; synonyms: Q81999, P89812, O77781


    or, basically, anything else. As you might guess, I'm interested in the
    "Q81231" or "Q81231" part. The idea is that my utility can take an
    optional "regexp" string that matches the type of headers that are found in
    a given database; while looping through the database, the regexp is
    matched, and entries are made for any of the synonymous identifiers found
    in one header.

    Currently, I am assuming that I will not find more than four synonims, and
    I do the following:

    for( $1, $2, $3, $4 ) {
    last unless $_ ;
    process_match( $_ ) ;
    }

    ....which is, of course, crap.

    Thanks in advance,
    January

    P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
    differ as well. Sometimes it is HBA_HUMAN.

    --
    January Weiner, Nov 13, 2006
    #1
    1. Advertising

  2. January Weiner

    Dr.Ruud Guest

    January Weiner schreef:

    > I am matching a regexp with an a priori unknown number of groups. I
    > would like to loop over all groups that were matched.


    Use the g-modifier, see perlre.
    Or use split + grep.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Nov 13, 2006
    #2
    1. Advertising

  3. January Weiner

    Guest

    On Nov 13, 12:49 pm, January Weiner <> wrote:
    > I am matching a regexp with an a priori unknown number of groups. I would
    > like to loop over all groups that were matched. For example:
    >
    > /(\w+)\s(\w+)/ ;
    > #or
    > /(\w+)\s(\w+)\s(\w+)/ ;
    > # or something else
    >
    > @groups = ...???
    >
    > for( @groups ) {
    > process_match( $_ ) ;
    > }



    use strict;
    use warnings;

    my %styles = (
    style1 => qr/([A-Z]\d{5})/,
    style2 => qr/([A-Z]{3}_[A-Z]{5})/,
    );

    my $header1 = "O81231 (Q81999) Dehydrogenase alpha subunit";
    my $header2 = "O81231 (HBA_HUMAN) Dehydrogenase alpha subunit";

    sub get_id {
    my ($header, $style) = @_;
    my ($id) = $header =~ m/$style/;
    return $id;
    }

    print get_id($header1, $styles{style1}), "\n"; # prints Q81999
    print get_id($header2, $styles{style2}), "\n"; # prints HBA_HUMAN

    __END__

    I'm not sure I entirely understand your question, but if you want to
    store regular expressions in a structure you can loop over, you just
    need the qr// operator. If I'm off base, just clarify what you mean and
    I'll try again, but I hope that helps! :)

    Regards,
    Michael
    http://www.perlcircus.org/
    , Nov 13, 2006
    #3
  4. January Weiner

    -berlin.de Guest

    January Weiner <> wrote in comp.lang.perl.misc:
    > Hello,
    >
    > I am matching a regexp with an a priori unknown number of groups. I would
    > like to loop over all groups that were matched. For example:
    >
    > /(\w+)\s(\w+)/ ;
    > #or
    > /(\w+)\s(\w+)\s(\w+)/ ;
    > # or something else
    >
    > @groups = ...???


    Very easy. Assuming the regex (with captures) in $re, and the string to
    match in $_ (untested):

    my @groups = m/$re/;

    A regex in list context returns all its captures.

    > for( @groups ) {
    > process_match( $_ ) ;
    > }


    Right on. Even

    process_match( $_) for m/$re/;

    would work.

    Anno


    > Of course, the above example is simplifying reality and could be replaced
    > by split(). Here are more details on the problem:
    >
    > I am processing protein sequence files in the FASTA format. Depending on
    > the database, the FASTA headers may look like that:
    >
    > >O81231 (Q81999) Dehydrogenase alpha subunit

    >
    > or like that
    >
    > > O81231 123 Q81999

    >
    > or
    >
    > >gi|O81231||li|Q81999

    >
    > or, possibly,
    >
    > >O81231; synonyms: Q81999, P89812, O77781

    >
    > or, basically, anything else. As you might guess, I'm interested in the
    > "Q81231" or "Q81231" part. The idea is that my utility can take an
    > optional "regexp" string that matches the type of headers that are found in
    > a given database; while looping through the database, the regexp is
    > matched, and entries are made for any of the synonymous identifiers found
    > in one header.
    >
    > Currently, I am assuming that I will not find more than four synonims, and
    > I do the following:
    >
    > for( $1, $2, $3, $4 ) {
    > last unless $_ ;
    > process_match( $_ ) ;
    > }
    >
    > ...which is, of course, crap.
    >
    > Thanks in advance,
    > January
    >
    > P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
    > differ as well. Sometimes it is HBA_HUMAN.
    >
    > --
    -berlin.de, Nov 13, 2006
    #4
  5. wrote:
    > I'm not sure I entirely understand your question, but if you want to
    > store regular expressions in a structure you can loop over, you just
    > need the qr// operator. If I'm off base, just clarify what you mean and
    > I'll try again, but I hope that helps! :)


    Sorry, I think I did not get it clear. Assume the following:

    - you have a regular expression
    - the regular expression contains an unknown number of groups enclosed in
    parentheses
    - you would like to print these groups, one by one.


    If you know exactly that there are two groups, you can do the following:

    $a =~ /(one) (two)/ ;

    print "group one: $1\n" ;
    print "group two: $2\n" ;

    My question is: what can I do if I do not know the number of the groups?
    For example, the regexp can be
    /(one) (two)/

    or it can be
    /(one) (two) (three)/

    or even
    /(one) (two) (three) (four)/

    My question rephrased: how can I loop through the automatic variables $1
    .... $n, where n is the number of groups in the regexp?

    Regards,
    j.

    --
    January Weiner, Nov 13, 2006
    #5
  6. -berlin.de wrote:
    > Very easy. Assuming the regex (with captures) in $re, and the string to
    > match in $_ (untested):


    > my @groups = m/$re/;


    > A regex in list context returns all its captures.


    Yes! That's it. Thank you so much. (very intuitive, when you think of it!)

    j.

    --
    January Weiner, Nov 13, 2006
    #6
  7. On 11/13/2006 06:49 AM, January Weiner wrote:
    > Hello,
    >
    > I am matching a regexp with an a priori unknown number of groups. I would
    > like to loop over all groups that were matched. For example:
    >
    > /(\w+)\s(\w+)/ ;
    > #or
    > /(\w+)\s(\w+)\s(\w+)/ ;
    > # or something else
    >
    > @groups = ...???
    >
    > for( @groups ) {
    > process_match( $_ ) ;
    > }
    >
    > Of course, the above example is simplifying reality and could be replaced
    > by split(). Here are more details on the problem:
    >
    > I am processing protein sequence files in the FASTA format. Depending on
    > the database, the FASTA headers may look like that:
    >
    >> O81231 (Q81999) Dehydrogenase alpha subunit

    >
    > or like that
    >
    >> O81231 123 Q81999

    >
    > or
    >
    >> gi|O81231||li|Q81999

    >
    > or, possibly,
    >
    >> O81231; synonyms: Q81999, P89812, O77781

    >
    > or, basically, anything else. As you might guess, I'm interested in the
    > "Q81231" or "Q81231" part. The idea is that my utility can take an
    > optional "regexp" string that matches the type of headers that are found in
    > a given database; while looping through the database, the regexp is
    > matched, and entries are made for any of the synonymous identifiers found
    > in one header.
    >
    > Currently, I am assuming that I will not find more than four synonims, and
    > I do the following:
    >
    > for( $1, $2, $3, $4 ) {
    > last unless $_ ;
    > process_match( $_ ) ;
    > }
    >
    > ....which is, of course, crap.
    >
    > Thanks in advance,
    > January
    >
    > P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
    > differ as well. Sometimes it is HBA_HUMAN.
    >


    This

    my @ids = /([[:upper:]\d]{3,})/g;

    is a possibility.


    --
    Mumia W. (reading news), Nov 13, 2006
    #7
  8. January Weiner

    Dr.Ruud Guest

    -berlin.de schreef:

    > Very easy. Assuming the regex (with captures) in $re, and the string
    > to match in $_ (untested):
    >
    > my @groups = m/$re/;
    >
    > A regex in list context returns all its captures.



    I think he meant to have only one (multi-format) capture in $re, so I am
    missing the g-modifier.

    $ perl -wle'
    $_ = "a b c";
    @_ = /([a-z])/;
    print "@_"
    '
    a

    $ perl -wle'
    $_ = "a b c";
    @_ = /([a-z])/g;
    print "@_"
    '
    a b c

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Nov 13, 2006
    #8
  9. January Weiner

    -berlin.de Guest

    January Weiner <> wrote in comp.lang.perl.misc:
    > -berlin.de wrote:
    > > Very easy. Assuming the regex (with captures) in $re, and the string to
    > > match in $_ (untested):

    >
    > > my @groups = m/$re/;

    >
    > > A regex in list context returns all its captures.

    >
    > Yes! That's it. Thank you so much. (very intuitive, when you think of it!)


    It is. The behavior varies slightly with whether the regex has captures
    and/or the /g modifier, but the variations usually do what you mean.

    In fact, list assignment is the preferred method of accessing regex
    captures. You avoid the special package variables $1, $2, ... and
    their scoping issues. You can give the captures meaningful names,
    individually or collectively. And, (your case), you don't have to
    know in advance how many captures there are.

    The only case where you can't avoid $1 etc. is when you need the
    behavior of /g in scalar context and have captures.

    Anno
    -berlin.de, Nov 14, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Petra Hübner
    Replies:
    0
    Views:
    428
    Petra Hübner
    Feb 16, 2004
  2. Sneaky Wombat

    loop over list and process into groups

    Sneaky Wombat, Mar 4, 2010, in forum: Python
    Replies:
    11
    Views:
    393
  3. Joao Silva
    Replies:
    16
    Views:
    340
    7stud --
    Aug 21, 2009
  4. Rick
    Replies:
    1
    Views:
    88
    Gunnar Hjalmarsson
    Oct 31, 2006
  5. Isaac Won
    Replies:
    9
    Views:
    349
    Ulrich Eckhardt
    Mar 4, 2013
Loading...

Share This Page