recursive brace matching with Ruby regexp

Discussion in 'Ruby' started by Jason Sweat, Nov 5, 2004.

  1. Jason Sweat

    Jason Sweat Guest

    I wanted to learn Ruby, so I picked a small task of trying to write a
    command line script to parse PHP classes and shell out some unit test
    cases. I have it working for the most part, but I ran across a
    problem trying to use Ruby regexp to find a set of matching curly
    braces.

    Please forgive the intrusion of this PHP code onto the list, but I
    wanted to give you the flavor of what I am attempting to do, that can
    be easily done with recursive regular expression available in the Perl
    compatiable regexp engine.

    <php>
    $test = <<<EOS
    /* some stuff */
    class foo {
    public \$var;
    public function __construct() {}
    public function bar() {
    if (false) {
    }
    }

    }
    // some other stuff
    EOS;

    $re = <<<EOS
    ~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
    EOS;
    preg_match($re, $test, $match);
    echo "your class matched:\n", $match[1];

    </php>

    Now it appears the regexp engine in Ruby does not support recursion
    (at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
    working on, and with what I know how to test), thus the only
    workaround I found was very ugly, model the nesting of braces to a
    fixed depth, i.e.

    open = '\{'
    close = '\}'
    other = '[^\{\}]*'
    l1 = other+open+other+close+other
    l2 = other+open+'('+l1 +')+'+other+close+other
    l3 = other+open+'('+l2 +')+'+other+close+other
    l4 = other+open+'('+l3 +')+'+other+close+other
    l5 = other+open+'('+l4 +')+'+other+close+other
    re = Regexp.new('class\s+'[email protected]+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
    '))+'+close, 'ixm')

    This code did work, but sometimes timed out on valid real classes.

    I expect I am probably missing some facet of Ruby that eaily allows me
    to next regexp inside of the Ruby code in some fasion to achieve the
    result I am looking for, but how to do so eludes me. Can anyone
    provide some insight for me on this situation?

    Thanks,

    Regards,
    Jason
     
    Jason Sweat, Nov 5, 2004
    #1
    1. Advertisements

  2. Ruby's regex engine doesn't yet support this, but will in future
    versions.

    Hope that helps.

    James Edward Gray II
     
    James Edward Gray II, Nov 5, 2004
    #2
    1. Advertisements

  3. Jason Sweat

    Bill Kelly Guest

    Just for fun,


    dat = <<-"END_TEXT"
    /* some stuff */
    class foo {
    public \$var;
    public function __construct() {}
    public function bar() {
    if (false) {
    }
    }

    }
    // some other stuff
    END_TEXT

    repl = []
    true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
    repl.push(str)
    "\0#{repl.length - 1}\0"
    end
    dat.scan(/class\s+\w+\s+\0\d+\0/) do
    match = $&
    true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
    puts "Your class matched:", match
    end



    The above matches arbitrarily deeply nested ( ) and { }
    blocks... but all the "classes" it matches have to be
    at toplevel, can't be inside a { } or ( ) themselves...

    I don't think it would be hard to modify to handle nested
    classes, but I haven't thought it through...


    In case it's not obvious how it works... It replaces,
    from innermost to outermost, () and {} spans with a
    token, and saves the original span in an array.

    When it finds a class match followed by one of these
    tokens, it expands it back out to its original
    representation.


    Regards,

    Bill
     
    Bill Kelly, Nov 5, 2004
    #3
  4. Jason Sweat

    Mark Hubbart Guest

    Hi,

    Regular expressions, by all standard definitions, aren't recursive.
    Perl's regexen have been extended to allow it, but it really isn't
    considered a standard regex feature. You might try using a simple
    tokenizer... here's a quick attempt:

    def parse(code)
    chunks = []
    loop do
    chunks << text.slice!(/\A.*?(?=[{}])/m) # match start of string to
    before next bracket
    bracket = text.slice! 0
    chunks << parse(text) if bracket == ?{
    return chunks if bracket == ?}
    return chunks if text.size == 0
    end
    end

    This returns a recursive array that holds all the text chunks around
    the brackets. here's some sample code (can't remember exactly what php
    looks like ATM) and sample output:

    PHP code:

    class Foo {
    def bar(){
    if true{
    do_stuff()
    }else{
    do_nothing()
    }
    clean_up()
    }
    }


    Then, after recursively stripping whitespace from strings, the
    pretty-printed array:

    ["class Foo",
    ["def bar()",
    ["if true", ["do_stuff()"], "else", ["do_nothing()"], "clean_up()"],
    ""],
    nil]

    uh, yeah, I know it drops off the last piece of text (reads it as nil)
    but I don't want to figure that out just yet. Dinner calls :)

    HTH,
    Mark
     
    Mark Hubbart, Nov 5, 2004
    #4
  5. Jason Sweat

    Jason Sweat Guest

    Thanks Bill,

    I think I see how that works. I will play with it and see if it
    solves my problem. I knew there had to be a much cleaner way than
    what I was doing :)

    Regards,
    Jason
     
    Jason Sweat, Nov 5, 2004
    #5
  6. Of course, you are correct. However, recursive "Regular Expressions"
    are becoming fairly common place now. Ruby's next regex engine will
    support them as well.

    Regular Expression has evolved considerably from the original
    definition, with recursive capabilities being just another change in a
    long line of added usefulness. Does that really means they cease to be
    Regular Expressions? Longhorn will still be Windows, right?

    I think of it more in terms of supersets or, for a programming slant,
    subclassing. The spirit of Regular Expressions, patterns to
    locate/breakdown text, is intact, I believe. They're just even more
    handy now.

    Just my two cents.

    James Edward Gray II

    P.S. Proving whether or not Perl 6 changes will still be "Regular
    Expression" is left as an exercise for the reader. ;)
     
    James Edward Gray II, Nov 5, 2004
    #6
  7. Jason Sweat

    Mark Hubbart Guest

    Hi,

    Are you saying Oniguruma *will* support it, or *does* support it? It
    would be nice if it already does :D
    My statement wasn't supposed to insinuate that adding recursion is
    bad, or turns regular expressions into something else; I was just
    pointing out that (last time I checked) you can't *expect* to be able
    to use recursion in regexen in a particular language. Perl's regexen
    are *more* than regexen; you can even embed perl code in them. Also,
    I'm pretty sure the recursive matching wasn't in Perl when I last used
    it, a couple years back.
    :)

    cheers,
    Mark
     
    Mark Hubbart, Nov 5, 2004
    #7
  8. Sure does.

    James Edward Gray II
     
    James Edward Gray II, Nov 5, 2004
    #8
  9. Jason Sweat

    Mark Hubbart Guest

    Mark Hubbart, Nov 5, 2004
    #9
  10. Jason Sweat

    Jos Backus Guest

    A-1. Syntax depend options

    + ONIG_SYNTAX_RUBY
    (?m): dot(.) match newline

    + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
    (?s): dot(.) match newline
    (?m): ^ match after newline, $ match before newline

    Any idea why Ruby was chosen to behave differently from Perl here?

    --
    Jos Backus _/ _/_/_/ Sunnyvale, CA
    _/ _/ _/
    _/ _/_/_/
    _/ _/ _/ _/
    jos at catnook.com _/_/ _/_/_/ require 'std/disclaimer'
     
    Jos Backus, Nov 5, 2004
    #10
  11. Yes.

    It very specifically mean that they stop being regular expressions,
    because the "regular" actually has a specific meaning (coming from
    regular sets/context free language theory), and has the nice property
    of compiling to a Deterministic Finite Automation with only linear
    increase in size of the DFA compared to the regular expression.

    The true regular expressions consists of ^, $, characters, *, and (|)
    (alternation). Most of the "original" extensions compile cleanly to
    this, and can still be considered regular. When you start using
    backrefs in matching or recursion, the expression is no longer regular
    (which, among other things, will almost certainly force your poor
    regexp engine into Non-deterministic Finite Automation mode - unless
    you've got exponential amounts of RAM, of course ;-)

    http://en.wikipedia.org/wiki/Regular_expression gives a bit more background.

    Eivind.
     
    Eivind Eklund, Nov 5, 2004
    #11
  12. Mark Hubbart ha scritto:
    I wonder if someone ever looked at the ABNF module on raa. It is a tool
    to generate a parser for a grammar in augmented backus naur form... the
    parser is a Regexp object, so it seem thgere is some general way to
    handle recursive parsing needs..
     
    gabriele renzi, Nov 5, 2004
    #12
  13. Jason Sweat

    Jason Sweat Guest

    Great!! Just comiled ruby from cvs (ruby 1.9.0 (2004-11-04)
    [i686-linux]) and was able to match using:

    dat = <<-"END_TEXT"
    /* some stuff */
    class foo {
    public $var;
    public function __construct() {}
    public function bar() {
    if (false) {
    }
    }

    }
    // some other stuff
    END_TEXT

    dat.match(/(class\s+\w+\s*(\{(?:[^\{\}]*|\g<2>)*\}))/)
    print $1

    Thank you very much for your help!!

    Regards,
    Jason
     
    Jason Sweat, Nov 5, 2004
    #13
  14. Okay Eivind, what would you prefer we call Regular Expression from now
    on?

    James Edward Gray II

    P.S. You better copy Jeffrey E. F. Friedl in you answer since his
    "Mastering Regular Expressions" covers considerably more NFA than DFA.
     
    James Edward Gray II, Nov 5, 2004
    #14
  15. Jason Sweat

    Ara.T.Howard Guest

    but both of those recognize only regular languages (those one can describe
    with a regular expression). it seems this new type must be called

    'context-free expressions'

    which would be the next class i think - that is unless this new type does
    considerably more.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | When you do something, you should burn yourself completely, like a good
    | bonfire, leaving no trace of yourself. --Shunryu Suzuki
    ===============================================================================
     
    Ara.T.Howard, Nov 5, 2004
    #15
  16. Matt Maycock ha scritto:
    pattern matching :)
     
    gabriele renzi, Nov 5, 2004
    #16
  17. Well, that's not quite true. They compile to NFAs that are linear in
    size of the input regular expression. The DFA constructed from the
    resulting NFA can be exponential in the size of that input. This is, of
    course, only the case for pathological cases, but DFAs are considerably
    larger than NFAs, even after minimizing them.
    Well, not really true either. True regular expressions don't include
    zero-width assertions, such as ^ and $. They include characters from
    some alphabet, * - Kleene closure, . - concatenation (which is implicit
    in most syntaxes), and | - alternation.
    And backtracking mode as well. Regular Expressions With Backreferences
    are an NP-complete problem and have none of the nice properties of
    regular expressions.
    ?

    All in all, backreferencing is an addition that doesn't really add very
    much yet ruins a lot of the whole deal with regular expressions. When
    people now start to try to add recursion to regular expressions (which
    would have been nice if they could have easily been included - but they
    can't) one has to shiver a bit. If you wan't recursion, use something
    suitable - like context-free languages; don't expect to solve everything
    with only one tool.
    nikolai
     
    Nikolai Weibull, Nov 6, 2004
    #17
  18. Call them regular expressions and stop using them for context-free
    language matching.
    nikolai
     
    Nikolai Weibull, Nov 6, 2004
    #18
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.