Regexp Question: Checking for [joe][/joe] pairs

Discussion in 'Ruby' started by Joe Peck, Dec 20, 2006.

  1. Joe Peck

    Joe Peck Guest

    Hey, I've got some text in @x and want there to be at least 1 and at
    most 3 [joe][/joe] pairs, each having at least one character between the
    beginning [joe] and the ending [/joe].

    This is what I have now, and it seems to sometimes work, and sometimes
    not.

    @x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)
     
    Joe Peck, Dec 20, 2006
    #1
    1. Advertisements

  2. Why are you doing /[\s\d\w]+?/? Just use /.+?/.

    Dan
     
    Daniel Finnie, Dec 20, 2006
    #2
    1. Advertisements

  3. Joe Peck

    dblack Guest

    Hi Joe --

    require 'test/unit'

    class JoeTest < Test::Unit::TestCase
    def setup
    @re = /(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/
    end

    def test_ok
    assert("[joe]abc[/joe]".match(@re))
    end

    def test_broken
    # ??? <--- fill in the blank :)
    end
    end


    David

    --
    Q. What's a good holiday present for the serious Rails developer?
    A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
    aka The Ruby book for Rails developers!
    Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
    A. Ruby Power and Light, LLC (http://www.rubypal.com)
     
    dblack, Dec 20, 2006
    #3
  4. Joe Peck

    Joe Peck Guest

    Good point. I was using .+? earlier, but thought that might be part of
    my problem. It seems to accept @x even if it contains more than 3
    [joe][/joe] pairs.
     
    Joe Peck, Dec 20, 2006
    #4
  5. Joe Peck

    dblack Guest

    Hi --

    \d is part of \w, so [\s\w] would be OK. But . is very different. It
    does not include newline (by default), and *does* include punctuation.


    David

    --
    Q. What's a good holiday present for the serious Rails developer?
    A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
    aka The Ruby book for Rails developers!
    Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
    A. Ruby Power and Light, LLC (http://www.rubypal.com)
     
    dblack, Dec 20, 2006
    #5
  6. Joe Peck

    dblack Guest

    Hi --

    That's because {1,3} doesn't mean there can't be another. Usually
    you'd anchor it or surround it with something else, like:

    /Xa{1,3}X/

    so it can't keep repeating.


    David

    --
    Q. What's a good holiday present for the serious Rails developer?
    A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
    aka The Ruby book for Rails developers!
    Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
    A. Ruby Power and Light, LLC (http://www.rubypal.com)
     
    dblack, Dec 20, 2006
    #6
  7. @x = "[joe] [/joe] [joe][/joe] [joe] foo [/joe]"
    count = @x.scan(/\[joe\](.*?)\[\/joe\]/m).flatten.
    reject{|s| ""==s}.size
    p count
     
    William James, Dec 20, 2006
    #7
  8. The problem is that the Regexp is not anchored to the start and ends of
    the string.

    /^(?!\[joe\])*.?(\[joe\].+?\[\\joe\]){1,3}(?!\[joe\])*.?$/

     
    Daniel Finnie, Dec 20, 2006
    #8
  9. Joe Peck

    Joe Peck Guest

    The problem is I don't want it to accept things like:
    "[joe] hello [joe] how are [/joe] you"
    where there are two opening tags before a closing tag is reached.
    Similarly, I don't want to accept something like:
    "hey [joe] it's hot today[/joe] where [joe] is the ac"
    where there is one correct pair but then an opening tag without a
    closing one.
     
    Joe Peck, Dec 21, 2006
    #9
  10. I missed the beginning of this thread, but if I recall correctly from my
    course on formal languages, this sort if thing can't be done with
    regular expressions.

    Regular expressions can be used to test whether a string belongs to a
    certain regular language, which is a subset of all possible languages
    (where a language is a set of strings). Regular expressions are
    equivalent to finite state automata in this respect. Since a finite
    state automata can only be in a finite number of states. You'd like to
    match a possibly infinitely large number of [joe][/joe] pairs. The FSA
    would need a new state for every extra [joe] it reads to remember it
    still needs to consume a matching [/joe] for it.

    If this sounds like Chinese, just remember regexpes aren't keen on
    matching this sort of stuff. Stacks on the other hand seem to be custom
    designed for these purposes.

    A.
     
    Arne Brasseur, Dec 21, 2006
    #10
  11. Joe Peck

    Joe Peck Guest

    Regular expressions can be used to test whether a string belongs to a
    It doesn't sound like Chinese :)

    If wouldn't have to be an infinite amount of states. Let's say these
    are the states:

    State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
    [/joe], fails.
    State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
    in this state, then fails. If it finds [/joe], increments count by 1
    and moves to state 1.

    If count goes above 3, fails.

    But maybe I'll use something besides a regexp, although I thought there
    would be a pretty easy way to do it.
     
    Joe Peck, Dec 21, 2006
    #11
  12. Joe Peck

    dblack Guest

    Hi --

    If you're using scan, though, doesn't that mean that you're not really
    trying to match one string to the regex, but rather a series of
    strings? That means that the state machine gets completely restarted
    as the scan method goes along the string. I think that's a different
    situation. You're not really saying: match all the pairs; you're
    saying: find the first substring that has a matching pair, then
    discard it and don't worry about backtracking through it; etc.


    David

    --
    Q. What's a good holiday present for the serious Rails developer?
    A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
    aka The Ruby book for Rails developers!
    Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
    A. Ruby Power and Light, LLC (http://www.rubypal.com)
     
    dblack, Dec 21, 2006
    #12
  13. To my knowledge, you can't do this with Ruby's current regexp engine,
    though it is possible with Perl and .NET. Both of those languages
    support something roughly analogous to a stack, within the expression.
    I don't think Ruby 1.8's regexp engine is powerful enough to handle
    this, but I would be happy to be proven wrong.

    It's worth remembering that what we call 'regular expressions' these
    days don't actually match the formal definition of that term, and are
    much more powerful in some ways.
     
    Wilson Bilkovich, Dec 21, 2006
    #13
  14. If a regular expression can't do it, does that mean we can't use
    a regular expression?

    No. We'll still use a regexp and add some code to help it.

    If all the pairs are matched, then after partitioning and zipping
    we wind up with the original pairs.

    [
    "ok [joe] ok [/joe] right",
    "ok [joe] [/joe] [joe] foo [/joe]",
    "bad [joe] [/joe] foo [joe]",
    "bad [joe] [/joe] foo [/joe]",
    "bad [joe] [joe] foo [/joe]",
    "bad [joe] [joe] ",
    "bad [/joe] [joe] ",
    "bad [/joe] [/joe] "
    ].each { |s|
    ary = s.scan( %r{\[/?joe\]} )
    p ary
    if ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)
    }.flatten
    puts "good\n"
    else
    puts "bad\n"
    end
    }

    --- output -----
    ["[joe]", "[/joe]"]
    good
    ["[joe]", "[/joe]", "[joe]", "[/joe]"]
    good
    ["[joe]", "[/joe]", "[joe]"]
    bad
    ["[joe]", "[/joe]", "[/joe]"]
    bad
    ["[joe]", "[joe]", "[/joe]"]
    bad
    ["[joe]", "[joe]"]
    bad
    ["[/joe]", "[joe]"]
    bad
    ["[/joe]", "[/joe]"]
    bad
     
    William James, Dec 21, 2006
    #14

  15. If I am reading your specs correctly:

    /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/


    cheers,
    andrew
     
    Andrew Johnson, Dec 22, 2006
    #15
  16. [
    "good [joe] [/joe] [joe] [/joe]",
    "bad [/joe] [joe] [/joe]",
    "bad [joe] [/joe] [/joe]"
    ].each { |s|
    p s =~
    /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
    }
     
    William James, Dec 22, 2006
    #16
  17. Quite right:

    [
    "good [joe] [/joe] [joe] [/joe]",
    "bad [/joe] [joe] [/joe]",
    "bad [joe] [/joe] [/joe]"
    ].each { |s|
    p s if s =~
    /^(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*$/
    }

    cheers
    andrew
     
    Andrew Johnson, Dec 24, 2006
    #17
  18. Yours is faster for very short strings; longer strings allow the array
    method to pull ahead.

    require 'benchmark'

    $n = 10_000
    $strings = [
    "good [joe] Wasn't that what he was seeking? [/joe]
    [joe] Can't you see that? [/joe]",
    "bad was Peck's boy [/joe] [joe] But he'll never know. [/joe]",
    "bad to the bone [joe] Or will he?! [/joe] mish mash mush
    Marching on Tom Tidler's ground fatigues me. [/joe]",
    "bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]",
    "bad: too few"
    ]

    def regexp
    $regexp_good = 0
    $n.times{ $strings.each { |s|
    $regexp_good += 1 if s =~
    /\A(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*\Z/m
    } }
    end
    def array
    $array_good = 0
    $n.times{ $strings.each { |s|
    ary = s.scan( %r{\[/?joe\]} )
    if [2,4,6].include?(ary.size) and
    ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)}.
    flatten
    $array_good += 1
    end
    } }
    end

    Benchmark.bmbm do |x|
    x.report("regexp") { regexp }
    x.report("array") { array }
    end

    puts ; p $regexp_good, $array_good

    Rehearsal ------------------------------------------
    regexp 6.870000 0.000000 6.870000 ( 7.391000)
    array 2.653000 0.000000 2.653000 ( 2.874000)
    --------------------------------- total: 9.523000sec

    user system total real
    regexp 6.940000 0.000000 6.940000 ( 7.441000)
    array 2.634000 0.000000 2.634000 ( 2.854000)

    10000
    10000
     
    William James, Dec 25, 2006
    #18
  19. William James wrote:
    [snip]

    The regex engine makes a difference in this case -- ruby1.8.5 + the
    oniguruma engine gives:


    Rehearsal ------------------------------------------
    regexp 2.740000 0.010000 2.750000 ( 2.960385)
    array 3.120000 0.000000 3.120000 ( 3.120808)
    --------------------------------- total: 5.870000sec

    user system total real
    regexp 2.750000 0.000000 2.750000 ( 2.743031)
    array 3.140000 0.000000 3.140000 ( 3.137098)

    10000
    10000


    cheers,
    andrew
     
    Andrew Johnson, Dec 25, 2006
    #19
  20. Oh, yeah? Try this on for size, Oni!

    require 'benchmark'

    $n = 10_000
    $strings = [
    %Q{
    good [joe] Wasn't that what he was seeking? [/joe]
    [joe] Can't you see that? [/joe]
    Early one June morning in 1872 I murdered my father---an act
    which made a deep impression on me at the time.

    We know better the needs of ourselves than of others. To
    serve oneself is economy of administration.

    self-evident, adj. Evident to one's self and to nobody else.

    senate, n. A body of elderly gentlemen charged with high
    duties and misdemeanors.
    [joe]
    "Thou wretch! -- thou vixen! -- thou shrew!" said I to my
    wife on the morning after our wedding; "thou witch! -- thou
    hag! -- thou whippersnapper -- thou sink of iniquity! --
    thou fiery-faced quintessence of all that is abominable! --
    thou -- thou--" here standing upon tiptoe, seizing her by the
    throat, and placing my mouth close to her ear, I was [/joe]
    preparing to launch forth a new and more decided epithet of
    opprobrium, which should not fail, if ejaculated, to
    convince her of her insignificance, when to my extreme
    horror and astonishment I discovered that I had lost my
    breath
    .
    },
    "bad was Peck's boy [/joe] [joe] But he'll never know. [/joe]",
    "bad to the bone [joe] Or will he?! [/joe] mish mash mush
    Marching on Tom Tidler's ground fatigues me. [/joe]",
    "bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]",
    "bad: too few"
    ]

    def regexp
    $regexp_good = 0
    $n.times{ $strings.each { |s|
    $regexp_good += 1 if s =~
    /\A(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*\Z/m
    } }
    end
    def array
    $array_good = 0
    $n.times{ $strings.each { |s|
    ary = s.scan( %r{\[/?joe\]} )
    if [2,4,6].include?(ary.size) and
    ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)}.
    flatten
    $array_good += 1
    end
    } }
    end

    Benchmark.bmbm do |x|
    x.report("regexp") { regexp }
    x.report("array") { array }
    end

    puts ; p $regexp_good, $array_good

    Rehearsal ------------------------------------------
    regexp 29.432000 3.354000 32.786000 ( 36.513000)
    array 3.225000 0.000000 3.225000 ( 3.485000)
    -------------------------------- total: 36.011000sec

    user system total real
    regexp 29.382000 3.656000 33.038000 ( 36.793000)
    array 3.195000 0.000000 3.195000 ( 3.525000)

    10000
    10000
     
    William James, Dec 25, 2006
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.