balanced paren regex's

I

ivowel

[posted earlier in perl.modules, but no answer.]

dear perl users: I want to write a function that extracts "ordinary"
subroutines from perl code. (an equivalent task is extracting all
macros from a latex file.) I am not trying to be too clever. let's
presume I can recognize subs because subs and only subs always start at
the first character of a line and are not anonymous. a sub is followed
by a name and can contain nested expressions.

I can do plain pattern matching to find the first occurance of the
first sub: '^sub \w+'. but now I am stuck. I need to continue
on with a Text::Balanced expression right after, and after the
text::balanced is done, continue on with my regex search (\G).

my $text=
"
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
";

and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

help appreciated.

sincerely, /iaw
 
X

Xicheng Jia

and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;
if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
$sub_def = $1;
}
$sub_def;
}
}
####################################
 
B

Ben Morrow

Quoth (e-mail address removed):
[posted earlier in perl.modules, but no answer.]

dear perl users: I want to write a function that extracts "ordinary"
subroutines from perl code. (an equivalent task is extracting all
macros from a latex file.) I am not trying to be too clever. let's
presume I can recognize subs because subs and only subs always start at
the first character of a line and are not anonymous. a sub is followed
by a name and can contain nested expressions.

I can do plain pattern matching to find the first occurance of the
first sub: '^sub \w+'. but now I am stuck. I need to continue
on with a Text::Balanced expression right after, and after the
text::balanced is done, continue on with my regex search (\G).

You mentioned Text::Balanced; how does extract_codeblock not do what you
want?

Ben
 
X

Xicheng Jia

Xicheng said:
and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;
if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
$sub_def = $1;
}
$sub_def;
}
}
####################################

BTW. you can make the subroutine "getnextsub" to skip any number of
function definitions.
#################
sub getnextsub {
my $text = shift;
return sub {
my $num_subs = shift || 1;
my ($sub_def, $cnt);
while ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
if (++$cnt == $num_subs) {
$sub_def = $1;
last;
}
}
return $sub_def || "undefined\n";
}
}
################
# if you have subroutine definitions a, b, c, d, e, f
# and in that order, then

$it = getnextsub($text); # set the iterator
$next_sub = $it->(); #get sub a {...}
$next_sub = $it->(3); #get sub d {...}
$next_sub = $it->(); #get sub e {...}
$next_sub = $it->(2); #return "undefined"
$it = getnextsub($text); # reset the iterator

Xicheng :)
 
X

Xicheng Jia

Xicheng said:
Xicheng said:
and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;

=> > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {

should change from \s* to \s+, and \w+ is enough to replace \S+

if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {

Xicheng
 
X

Xicheng Jia

Xicheng said:
Xicheng Jia wrote: [snip]
=> > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {

should change from \s* to \s+, and \w+ is enough to replace \S+

if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {
one more modification:

if ($text =~ s/.*?(sub\s+\w+\s*{$pattern})//) {

Xicheng
 
I

ivowel

I am truly becoming greedy now. is there a good/clever way to keep
track on which lineno the match was made on (i.e., how many \n occurred
before)? [something similar to $., but in connection with a text
match.]

regards,

/iaw
 
X

Xicheng Jia

I am truly becoming greedy now. is there a good/clever way to keep
track on which lineno the match was made on (i.e., how many \n occurred
before)? [something similar to $., but in connection with a text
match.]

sure you can. the key for this method (you might want to read [1] for
more introduction about iteration) is how to use "closure" in Perl
subroutines. I revised the previous subroutine again and fixed some
bugs. :

1) 's' modifier is added in the s/// expression, otherwise .*? can not
match multiple lines;
2) capture two parts: $1, and $2, and use something like $1 =~ tr/\n//;
to count the number of newlines in a substring.
3) "return" statement is revised so that you can use the iterator in a
while loop;
4) two variables added: $line_num to count newlines in the whole
matched text block. $lineno is the line_number containing the keyword
'sub' of your function declaration...

###################################
sub getnextsub {
my $text = shift;
my $line_num = 0;
return sub {
my $num_subs = shift || 1;
my ($sub_def, $cnt) = ("", 0);
while ($text =~ s/(.*?(sub\s*\S+\s*{$pattern}))//s) {
$line_num += ($1 =~ tr/\n//);
if (++$cnt == $num_subs) {
$sub_def = $2;
my $lineno = $line_num + 1 - ($sub_def =~ tr/\n//);
print "line_number is $lineno\n";
last;
}
}
print "undefined\n" if not $sub_def;
return $sub_def;
}
} # end of getnextsub #
###################################

don't know where you want the line numbers to go, so just print them
out.

Good luck
Xicheng

[1] "Higher-Order Perl: Transforming Programs with Programs", by M.J.
Dominus.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top