balanced paren regex's

Discussion in 'Perl Misc' started by ivowel@gmail.com, Jun 17, 2006.

  1. Guest

    [posted earlier in perl.modules, but no answer.]

    dear perl users: I want to write a function that extracts "ordinary"
    subroutines from perl code. (an equivalent task is extracting all
    macros from a latex file.) I am not trying to be too clever. let's
    presume I can recognize subs because subs and only subs always start at
    the first character of a line and are not anonymous. a sub is followed
    by a name and can contain nested expressions.

    I can do plain pattern matching to find the first occurance of the
    first sub: '^sub \w+'. but now I am stuck. I need to continue
    on with a Text::Balanced expression right after, and after the
    text::balanced is done, continue on with my regex search (\G).

    my $text=
    "
    { text }
    expressions
    sub a {
    if (1==1) { print "true"; } # if string or comment could contain
    unbalanced paren, even better
    }
    more expressions
    sub b {
    if (0) { print "false" }
    }
    ";

    and I want to call a subroutine getnextsub() that first will return
    "sub a {\n if (1==1) { print \"true\"; } # if string or
    comment could contain unbalanced paren, even better\n }"
    and on the next call will return
    "sub b {\n if (0) { print \"false\" }\n }";

    this must be a fairly common problem (e.g., extracting tex macro
    arguments, etc.), but short of mimicking C in writing very low-level
    paren counter subroutines on individual characters, I cannot see how to
    solve this. I do understand the issue of how to treat nested subs, but
    this right now is only a secondary concern.

    help appreciated.

    sincerely, /iaw
     
    , Jun 17, 2006
    #1
    1. Advertising

  2. Dr.Ruud Guest

    schreef:

    > I want to write a function that extracts "ordinary"
    > subroutines from perl code.


    Search CPAN on 'balanced' or on 'parse'.

    See also
    http://search.cpan.org/search?module=PPI

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 17, 2006
    #2
    1. Advertising

  3. Xicheng Jia Guest

    wrote:
    [snip]
    > and I want to call a subroutine getnextsub() that first will return
    > "sub a {\n if (1==1) { print \"true\"; } # if string or
    > comment could contain unbalanced paren, even better\n }"
    > and on the next call will return
    > "sub b {\n if (0) { print \"false\" }\n }";
    >
    > this must be a fairly common problem (e.g., extracting tex macro
    > arguments, etc.), but short of mimicking C in writing very low-level
    > paren counter subroutines on individual characters, I cannot see how to
    > solve this. I do understand the issue of how to treat nested subs, but
    > this right now is only a secondary concern.


    Here is one way you may use(the iterator way from the book HOP):
    #########################
    use strict;
    use warnings;

    my $text= <<'END_TEST';
    { text }
    expressions
    sub a {
    if (1==1) { print "true"; } # if string or comment could contain
    unbalanced paren, even better
    }
    more expressions
    sub b {
    if (0) { print "false" }
    }
    safdsf
    END_TEST

    local our $n;
    # pattern to track embedded braces
    my $pattern = qr/
    (?> (?{$n = 0})
    (?:
    [^{}]
    |
    \{ (?{$n++})
    |
    \} (?(?{$n != 0}) (?{$n--}) | (?!) )
    )*
    )(?(?{$n != 0})(?!))
    /x;

    # set the iterator
    my $it = getnextsub($text);
    my $count = 0;

    # loop through the text and print out all functions
    while (my $next_sub = $it->()) {
    print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
    }

    # subroutine to set the iteratior
    sub getnextsub {
    my $text = shift;
    return sub {
    my $sub_def;
    if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
    $sub_def = $1;
    }
    $sub_def;
    }
    }
    ####################################
     
    Xicheng Jia, Jun 17, 2006
    #3
  4. Ben Morrow Guest

    Quoth :
    >
    > [posted earlier in perl.modules, but no answer.]
    >
    > dear perl users: I want to write a function that extracts "ordinary"
    > subroutines from perl code. (an equivalent task is extracting all
    > macros from a latex file.) I am not trying to be too clever. let's
    > presume I can recognize subs because subs and only subs always start at
    > the first character of a line and are not anonymous. a sub is followed
    > by a name and can contain nested expressions.
    >
    > I can do plain pattern matching to find the first occurance of the
    > first sub: '^sub \w+'. but now I am stuck. I need to continue
    > on with a Text::Balanced expression right after, and after the
    > text::balanced is done, continue on with my regex search (\G).


    You mentioned Text::Balanced; how does extract_codeblock not do what you
    want?

    Ben

    --
    The cosmos, at best, is like a rubbish heap scattered at random.
    Heraclitus
     
    Ben Morrow, Jun 17, 2006
    #4
  5. Xicheng Jia Guest

    Xicheng Jia wrote:
    > wrote:
    > [snip]
    > > and I want to call a subroutine getnextsub() that first will return
    > > "sub a {\n if (1==1) { print \"true\"; } # if string or
    > > comment could contain unbalanced paren, even better\n }"
    > > and on the next call will return
    > > "sub b {\n if (0) { print \"false\" }\n }";
    > >
    > > this must be a fairly common problem (e.g., extracting tex macro
    > > arguments, etc.), but short of mimicking C in writing very low-level
    > > paren counter subroutines on individual characters, I cannot see how to
    > > solve this. I do understand the issue of how to treat nested subs, but
    > > this right now is only a secondary concern.

    >
    > Here is one way you may use(the iterator way from the book HOP):
    > #########################
    > use strict;
    > use warnings;
    >
    > my $text= <<'END_TEST';
    > { text }
    > expressions
    > sub a {
    > if (1==1) { print "true"; } # if string or comment could contain
    > unbalanced paren, even better
    > }
    > more expressions
    > sub b {
    > if (0) { print "false" }
    > }
    > safdsf
    > END_TEST
    >
    > local our $n;
    > # pattern to track embedded braces
    > my $pattern = qr/
    > (?> (?{$n = 0})
    > (?:
    > [^{}]
    > |
    > \{ (?{$n++})
    > |
    > \} (?(?{$n != 0}) (?{$n--}) | (?!) )
    > )*
    > )(?(?{$n != 0})(?!))
    > /x;
    >
    > # set the iterator
    > my $it = getnextsub($text);
    > my $count = 0;
    >
    > # loop through the text and print out all functions
    > while (my $next_sub = $it->()) {
    > print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
    > }
    >
    > # subroutine to set the iteratior
    > sub getnextsub {
    > my $text = shift;
    > return sub {
    > my $sub_def;
    > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
    > $sub_def = $1;
    > }
    > $sub_def;
    > }
    > }
    > ####################################


    BTW. you can make the subroutine "getnextsub" to skip any number of
    function definitions.
    #################
    sub getnextsub {
    my $text = shift;
    return sub {
    my $num_subs = shift || 1;
    my ($sub_def, $cnt);
    while ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
    if (++$cnt == $num_subs) {
    $sub_def = $1;
    last;
    }
    }
    return $sub_def || "undefined\n";
    }
    }
    ################
    # if you have subroutine definitions a, b, c, d, e, f
    # and in that order, then

    $it = getnextsub($text); # set the iterator
    $next_sub = $it->(); #get sub a {...}
    $next_sub = $it->(3); #get sub d {...}
    $next_sub = $it->(); #get sub e {...}
    $next_sub = $it->(2); #return "undefined"
    $it = getnextsub($text); # reset the iterator

    Xicheng :)
     
    Xicheng Jia, Jun 18, 2006
    #5
  6. Xicheng Jia Guest

    Xicheng Jia wrote:
    > Xicheng Jia wrote:
    > > wrote:
    > > [snip]
    > > > and I want to call a subroutine getnextsub() that first will return
    > > > "sub a {\n if (1==1) { print \"true\"; } # if string or
    > > > comment could contain unbalanced paren, even better\n }"
    > > > and on the next call will return
    > > > "sub b {\n if (0) { print \"false\" }\n }";
    > > >
    > > > this must be a fairly common problem (e.g., extracting tex macro
    > > > arguments, etc.), but short of mimicking C in writing very low-level
    > > > paren counter subroutines on individual characters, I cannot see how to
    > > > solve this. I do understand the issue of how to treat nested subs, but
    > > > this right now is only a secondary concern.

    > >
    > > Here is one way you may use(the iterator way from the book HOP):
    > > #########################
    > > use strict;
    > > use warnings;
    > >
    > > my $text= <<'END_TEST';
    > > { text }
    > > expressions
    > > sub a {
    > > if (1==1) { print "true"; } # if string or comment could contain
    > > unbalanced paren, even better
    > > }
    > > more expressions
    > > sub b {
    > > if (0) { print "false" }
    > > }
    > > safdsf
    > > END_TEST
    > >
    > > local our $n;
    > > # pattern to track embedded braces
    > > my $pattern = qr/
    > > (?> (?{$n = 0})
    > > (?:
    > > [^{}]
    > > |
    > > \{ (?{$n++})
    > > |
    > > \} (?(?{$n != 0}) (?{$n--}) | (?!) )
    > > )*
    > > )(?(?{$n != 0})(?!))
    > > /x;
    > >
    > > # set the iterator
    > > my $it = getnextsub($text);
    > > my $count = 0;
    > >
    > > # loop through the text and print out all functions
    > > while (my $next_sub = $it->()) {
    > > print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
    > > }
    > >
    > > # subroutine to set the iteratior
    > > sub getnextsub {
    > > my $text = shift;
    > > return sub {
    > > my $sub_def;


    => > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {

    should change from \s* to \s+, and \w+ is enough to replace \S+

    if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {

    Xicheng
     
    Xicheng Jia, Jun 18, 2006
    #6
  7. Xicheng Jia Guest

    Xicheng Jia wrote:
    > Xicheng Jia wrote:

    [snip]
    > => > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
    >
    > should change from \s* to \s+, and \w+ is enough to replace \S+
    >
    > if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {
    >

    one more modification:

    if ($text =~ s/.*?(sub\s+\w+\s*{$pattern})//) {

    Xicheng
     
    Xicheng Jia, Jun 18, 2006
    #7
  8. Guest

    thank you very much. regards, /iaw
     
    , Jun 18, 2006
    #8
  9. Guest

    I am truly becoming greedy now. is there a good/clever way to keep
    track on which lineno the match was made on (i.e., how many \n occurred
    before)? [something similar to $., but in connection with a text
    match.]

    regards,

    /iaw
     
    , Jun 18, 2006
    #9
  10. Xicheng Jia Guest

    wrote:
    > I am truly becoming greedy now. is there a good/clever way to keep
    > track on which lineno the match was made on (i.e., how many \n occurred
    > before)? [something similar to $., but in connection with a text
    > match.]


    sure you can. the key for this method (you might want to read [1] for
    more introduction about iteration) is how to use "closure" in Perl
    subroutines. I revised the previous subroutine again and fixed some
    bugs. :

    1) 's' modifier is added in the s/// expression, otherwise .*? can not
    match multiple lines;
    2) capture two parts: $1, and $2, and use something like $1 =~ tr/\n//;
    to count the number of newlines in a substring.
    3) "return" statement is revised so that you can use the iterator in a
    while loop;
    4) two variables added: $line_num to count newlines in the whole
    matched text block. $lineno is the line_number containing the keyword
    'sub' of your function declaration...

    ###################################
    sub getnextsub {
    my $text = shift;
    my $line_num = 0;
    return sub {
    my $num_subs = shift || 1;
    my ($sub_def, $cnt) = ("", 0);
    while ($text =~ s/(.*?(sub\s*\S+\s*{$pattern}))//s) {
    $line_num += ($1 =~ tr/\n//);
    if (++$cnt == $num_subs) {
    $sub_def = $2;
    my $lineno = $line_num + 1 - ($sub_def =~ tr/\n//);
    print "line_number is $lineno\n";
    last;
    }
    }
    print "undefined\n" if not $sub_def;
    return $sub_def;
    }
    } # end of getnextsub #
    ###################################

    don't know where you want the line numbers to go, so just print them
    out.

    Good luck
    Xicheng

    [1] "Higher-Order Perl: Transforming Programs with Programs", by M.J.
    Dominus.
     
    Xicheng Jia, Jun 18, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Curts
    Replies:
    1
    Views:
    340
    Lee Fesperman
    Aug 21, 2003
  2. Josuan
    Replies:
    0
    Views:
    296
    Josuan
    Jun 2, 2008
  3. Paul McGuire

    Re: regex for balanced parentheses?

    Paul McGuire, Jun 12, 2008, in forum: Python
    Replies:
    2
    Views:
    597
    Tim Arnold
    Jun 12, 2008
  4. C. J. Clegg

    Whitespace before opening paren in function call?

    C. J. Clegg, Mar 7, 2009, in forum: C Programming
    Replies:
    2
    Views:
    540
    Kaz Kylheku
    Mar 7, 2009
  5. iporter
    Replies:
    4
    Views:
    193
Loading...

Share This Page