recursive brace matching with Ruby regexp

J

Jason Sweat

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Thanks,

Regards,
Jason
 
J

James Edward Gray II

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Ruby's regex engine doesn't yet support this, but will in future
versions.

Hope that helps.

James Edward Gray II
 
B

Bill Kelly

From: "James Edward Gray II said:
Ruby's regex engine doesn't yet support this, but will in future
versions.

Just for fun,


dat = <<-"END_TEXT"
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
END_TEXT

repl = []
true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
repl.push(str)
"\0#{repl.length - 1}\0"
end
dat.scan(/class\s+\w+\s+\0\d+\0/) do
match = $&
true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
puts "Your class matched:", match
end



The above matches arbitrarily deeply nested ( ) and { }
blocks... but all the "classes" it matches have to be
at toplevel, can't be inside a { } or ( ) themselves...

I don't think it would be hard to modify to handle nested
classes, but I haven't thought it through...


In case it's not obvious how it works... It replaces,
from innermost to outermost, () and {} spans with a
token, and saves the original span in an array.

When it finds a class match followed by one of these
tokens, it expands it back out to its original
representation.


Regards,

Bill
 
M

Mark Hubbart

Hi,

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature. You might try using a simple
tokenizer... here's a quick attempt:

def parse(code)
chunks = []
loop do
chunks << text.slice!(/\A.*?(?=[{}])/m) # match start of string to
before next bracket
bracket = text.slice! 0
chunks << parse(text) if bracket == ?{
return chunks if bracket == ?}
return chunks if text.size == 0
end
end

This returns a recursive array that holds all the text chunks around
the brackets. here's some sample code (can't remember exactly what php
looks like ATM) and sample output:

PHP code:

class Foo {
def bar(){
if true{
do_stuff()
}else{
do_nothing()
}
clean_up()
}
}


Then, after recursively stripping whitespace from strings, the
pretty-printed array:

["class Foo",
["def bar()",
["if true", ["do_stuff()"], "else", ["do_nothing()"], "clean_up()"],
""],
nil]

uh, yeah, I know it drops off the last piece of text (reads it as nil)
but I don't want to figure that out just yet. Dinner calls :)

HTH,
Mark
 
J

Jason Sweat

repl = []
true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
repl.push(str)
"\0#{repl.length - 1}\0"
end
dat.scan(/class\s+\w+\s+\0\d+\0/) do
match = $&
true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
puts "Your class matched:", match
end

The above matches arbitrarily deeply nested ( ) and { }
blocks... but all the "classes" it matches have to be
at toplevel, can't be inside a { } or ( ) themselves...

In case it's not obvious how it works... It replaces,
from innermost to outermost, () and {} spans with a
token, and saves the original span in an array.

Thanks Bill,

I think I see how that works. I will play with it and see if it
solves my problem. I knew there had to be a much cleaner way than
what I was doing :)

Regards,
Jason
 
J

James Edward Gray II

Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature.

Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions? Longhorn will still be Windows, right?

I think of it more in terms of supersets or, for a programming slant,
subclassing. The spirit of Regular Expressions, patterns to
locate/breakdown text, is intact, I believe. They're just even more
handy now.

Just my two cents.

James Edward Gray II

P.S. Proving whether or not Perl 6 changes will still be "Regular
Expression" is left as an exercise for the reader. ;)
 
M

Mark Hubbart

Hi,

Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Are you saying Oniguruma *will* support it, or *does* support it? It
would be nice if it already does :D
Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions? Longhorn will still be Windows, right?

I think of it more in terms of supersets or, for a programming slant,
subclassing. The spirit of Regular Expressions, patterns to
locate/breakdown text, is intact, I believe. They're just even more
handy now.

My statement wasn't supposed to insinuate that adding recursion is
bad, or turns regular expressions into something else; I was just
pointing out that (last time I checked) you can't *expect* to be able
to use recursion in regexen in a particular language. Perl's regexen
are *more* than regexen; you can even embed perl code in them. Also,
I'm pretty sure the recursive matching wasn't in Perl when I last used
it, a couple years back.
Just my two cents.

James Edward Gray II

P.S. Proving whether or not Perl 6 changes will still be "Regular
Expression" is left as an exercise for the reader. ;)

:)

cheers,
Mark
 
J

Jos Backus


A-1. Syntax depend options

+ ONIG_SYNTAX_RUBY
(?m): dot(.) match newline

+ ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
(?s): dot(.) match newline
(?m): ^ match after newline, $ match before newline

Any idea why Ruby was chosen to behave differently from Perl here?

--
Jos Backus _/ _/_/_/ Sunnyvale, CA
_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
jos at catnook.com _/_/ _/_/_/ require 'std/disclaimer'
 
E

Eivind Eklund

Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions?

Yes.

It very specifically mean that they stop being regular expressions,
because the "regular" actually has a specific meaning (coming from
regular sets/context free language theory), and has the nice property
of compiling to a Deterministic Finite Automation with only linear
increase in size of the DFA compared to the regular expression.

The true regular expressions consists of ^, $, characters, *, and (|)
(alternation). Most of the "original" extensions compile cleanly to
this, and can still be considered regular. When you start using
backrefs in matching or recursion, the expression is no longer regular
(which, among other things, will almost certainly force your poor
regexp engine into Non-deterministic Finite Automation mode - unless
you've got exponential amounts of RAM, of course ;-)

http://en.wikipedia.org/wiki/Regular_expression gives a bit more background.

Eivind.
 
G

gabriele renzi

Mark Hubbart ha scritto:
Hi,

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?


Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature. You might try using a simple
tokenizer... here's a quick attempt:

I wonder if someone ever looked at the ABNF module on raa. It is a tool
to generate a parser for a grammar in augmented backus naur form... the
parser is a Regexp object, so it seem thgere is some general way to
handle recursive parsing needs..
 
J

Jason Sweat

Indeed it does! I just tried it out. And named subexpressions work too
:) many new features since I last checked it out:

http://www.geocities.jp/kosako1/oniguruma/doc/RE.txt

Thanks for pointing it out.

Great!! Just comiled ruby from cvs (ruby 1.9.0 (2004-11-04)
[i686-linux]) and was able to match using:

dat = <<-"END_TEXT"
/* some stuff */
class foo {
public $var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
END_TEXT

dat.match(/(class\s+\w+\s*(\{(?:[^\{\}]*|\g<2>)*\}))/)
print $1

Thank you very much for your help!!

Regards,
Jason
 
J

James Edward Gray II

Yes.

It very specifically mean that they stop being regular expressions,

Okay Eivind, what would you prefer we call Regular Expression from now
on?

James Edward Gray II

P.S. You better copy Jeffrey E. F. Friedl in you answer since his
"Mastering Regular Expressions" covers considerably more NFA than DFA.
 
A

Ara.T.Howard

Okay Eivind, what would you prefer we call Regular Expression from now
on?

James Edward Gray II

P.S. You better copy Jeffrey E. F. Friedl in you answer since his
"Mastering Regular Expressions" covers considerably more NFA than DFA.

but both of those recognize only regular languages (those one can describe
with a regular expression). it seems this new type must be called

'context-free expressions'

which would be the next class i think - that is unless this new type does
considerably more.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================
 
G

gabriele renzi

Matt Maycock ha scritto:
Ruby Automata? Irregular Expressions? Irregardless Expressions?
Syntax-Restricted-State-Machines (SRSM)? Susan/Jacob/Drew (depending
on whether your language makes them female, male, or neuter)?

pattern matching :)
 
N

Nikolai Weibull

It very specifically mean that they stop being regular expressions,
because the "regular" actually has a specific meaning (coming from
regular sets/context free language theory), and has the nice property
of compiling to a Deterministic Finite Automation with only linear
increase in size of the DFA compared to the regular expression.

Well, that's not quite true. They compile to NFAs that are linear in
size of the input regular expression. The DFA constructed from the
resulting NFA can be exponential in the size of that input. This is, of
course, only the case for pathological cases, but DFAs are considerably
larger than NFAs, even after minimizing them.
The true regular expressions consists of ^, $, characters, *, and (|)
(alternation).

Well, not really true either. True regular expressions don't include
zero-width assertions, such as ^ and $. They include characters from
some alphabet, * - Kleene closure, . - concatenation (which is implicit
in most syntaxes), and | - alternation.
Most of the "original" extensions compile cleanly to this, and can
still be considered regular.
Yes.

When you start using backrefs in matching or recursion, the expression
is no longer regular (which, among other things, will almost certainly
force your poor regexp engine into Non-deterministic Finite Automation
mode -

And backtracking mode as well. Regular Expressions With Backreferences
are an NP-complete problem and have none of the nice properties of
regular expressions.
unless you've got exponential amounts of RAM, of course ;-)

?

All in all, backreferencing is an addition that doesn't really add very
much yet ruins a lot of the whole deal with regular expressions. When
people now start to try to add recursion to regular expressions (which
would have been nice if they could have easily been included - but they
can't) one has to shiver a bit. If you wan't recursion, use something
suitable - like context-free languages; don't expect to solve everything
with only one tool.
nikolai
 
N

Nikolai Weibull

Okay Eivind, what would you prefer we call Regular Expression from now
on?

Call them regular expressions and stop using them for context-free
language matching.
nikolai
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top