balanced paren regex's

ivowel · Jun 17, 2006

[posted earlier in perl.modules, but no answer.]

dear perl users: I want to write a function that extracts "ordinary"
subroutines from perl code. (an equivalent task is extracting all
macros from a latex file.) I am not trying to be too clever. let's
presume I can recognize subs because subs and only subs always start at
the first character of a line and are not anonymous. a sub is followed
by a name and can contain nested expressions.

I can do plain pattern matching to find the first occurance of the
first sub: '^sub \w+'. but now I am stuck. I need to continue
on with a Text::Balanced expression right after, and after the
text::balanced is done, continue on with my regex search (\G).

my $text=
"
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
";

and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

help appreciated.

sincerely, /iaw

Dr.Ruud · Jun 17, 2006

(e-mail address removed) schreef:

I want to write a function that extracts "ordinary"
subroutines from perl code.

Search CPAN on 'balanced' or on 'parse'.

See also
http://search.cpan.org/search?module=PPI

Xicheng Jia · Jun 17, 2006

and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;
if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
$sub_def = $1;
}
$sub_def;
}
}
####################################

Ben Morrow · Jun 17, 2006

Quoth (e-mail address removed):

[posted earlier in perl.modules, but no answer.]

dear perl users: I want to write a function that extracts "ordinary"
subroutines from perl code. (an equivalent task is extracting all
macros from a latex file.) I am not trying to be too clever. let's
presume I can recognize subs because subs and only subs always start at
the first character of a line and are not anonymous. a sub is followed
by a name and can contain nested expressions.

I can do plain pattern matching to find the first occurance of the
first sub: '^sub \w+'. but now I am stuck. I need to continue
on with a Text::Balanced expression right after, and after the
text::balanced is done, continue on with my regex search (\G).

You mentioned Text::Balanced; how does extract_codeblock not do what you
want?

Ben

Xicheng Jia · Jun 18, 2006

Xicheng said:
and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

Click to expand...

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;
if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
$sub_def = $1;
}
$sub_def;
}
}
####################################

BTW. you can make the subroutine "getnextsub" to skip any number of
function definitions.
#################
sub getnextsub {
my $text = shift;
return sub {
my $num_subs = shift || 1;
my ($sub_def, $cnt);
while ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
if (++$cnt == $num_subs) {
$sub_def = $1;
last;
}
}
return $sub_def || "undefined\n";
}
}
################
# if you have subroutine definitions a, b, c, d, e, f
# and in that order, then

$it = getnextsub($text); # set the iterator
$next_sub = $it->(); #get sub a {...}
$next_sub = $it->(3); #get sub d {...}
$next_sub = $it->(); #get sub e {...}
$next_sub = $it->(2); #return "undefined"
$it = getnextsub($text); # reset the iterator

Xicheng

Xicheng Jia · Jun 18, 2006

Xicheng said:
Xicheng said:

and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

Click to expand...

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;

Click to expand...

=> > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {

should change from \s* to \s+, and \w+ is enough to replace \S+

if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {

Xicheng

Xicheng Jia · Jun 18, 2006

Xicheng said:
Xicheng Jia wrote: [snip]
=> > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {

should change from \s* to \s+, and \w+ is enough to replace \S+

if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {

one more modification:

if ($text =~ s/.*?(sub\s+\w+\s*{$pattern})//) {

Xicheng

ivowel · Jun 18, 2006

thank you very much. regards, /iaw

ivowel · Jun 18, 2006

I am truly becoming greedy now. is there a good/clever way to keep
track on which lineno the match was made on (i.e., how many \n occurred
before)? [something similar to $., but in connection with a text
match.]

regards,

/iaw

Xicheng Jia · Jun 18, 2006

I am truly becoming greedy now. is there a good/clever way to keep
track on which lineno the match was made on (i.e., how many \n occurred
before)? [something similar to $., but in connection with a text
match.]

sure you can. the key for this method (you might want to read [1] for
more introduction about iteration) is how to use "closure" in Perl
subroutines. I revised the previous subroutine again and fixed some
bugs. :

1) 's' modifier is added in the s/// expression, otherwise .*? can not
match multiple lines;
2) capture two parts: $1, and $2, and use something like $1 =~ tr/\n//;
to count the number of newlines in a substring.
3) "return" statement is revised so that you can use the iterator in a
while loop;
4) two variables added: $line_num to count newlines in the whole
matched text block. $lineno is the line_number containing the keyword
'sub' of your function declaration...

###################################
sub getnextsub {
my $text = shift;
my $line_num = 0;
return sub {
my $num_subs = shift || 1;
my ($sub_def, $cnt) = ("", 0);
while ($text =~ s/(.*?(sub\s*\S+\s*{$pattern}))//s) {
$line_num += ($1 =~ tr/\n//);
if (++$cnt == $num_subs) {
$sub_def = $2;
my $lineno = $line_num + 1 - ($sub_def =~ tr/\n//);
print "line_number is $lineno\n";
last;
}
}
print "undefined\n" if not $sub_def;
return $sub_def;
}
} # end of getnextsub #
###################################

don't know where you want the line numbers to go, so just print them
out.

Good luck
Xicheng

[1] "Higher-Order Perl: Transforming Programs with Programs", by M.J.
Dominus.

FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
Tic Tac Toe Game	2	Mar 10, 2024
ChatBot	4	Jan 19, 2021
matching balanced parens	3	Dec 26, 2003
Addition and substraction of polynomials is working fine but the multiplication isn't; what's wrong with my code	1	Nov 22, 2022
Detecting sub/call mismatches?	3	Nov 21, 2012
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Why is Python telling me variable is local not global?	3	Sep 2, 2023

balanced paren regex's

ivowel

Dr.Ruud

Xicheng Jia

Ben Morrow

Xicheng Jia

Xicheng Jia

Xicheng Jia

ivowel

ivowel

Xicheng Jia

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads