Regexp Question: Checking for [joe][/joe] pairs

J

Joe Peck

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)
 
D

dblack

Hi Joe --

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

require 'test/unit'

class JoeTest < Test::Unit::TestCase
def setup
@re = /(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/
end

def test_ok
assert("[joe]abc[/joe]".match(@re))
end

def test_broken
# ??? <--- fill in the blank :)
end
end


David

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)
 
J

Joe Peck

Daniel said:
Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.
 
D

dblack

Hi --

Joe said:
Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)
Why are you doing /[\s\d\w]+?/? Just use /.+?/.

\d is part of \w, so [\s\w] would be OK. But . is very different. It
does not include newline (by default), and *does* include punctuation.


David

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)
 
D

dblack

Hi --

Daniel said:
Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

That's because {1,3} doesn't mean there can't be another. Usually
you'd anchor it or surround it with something else, like:

/Xa{1,3}X/

so it can't keep repeating.


David

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)
 
W

William James

Joe said:
Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

@x = "[joe] [/joe] [joe][/joe] [joe] foo [/joe]"
count = @x.scan(/\[joe\](.*?)\[\/joe\]/m).flatten.
reject{|s| ""==s}.size
p count
 
D

Daniel Finnie

The problem is that the Regexp is not anchored to the start and ends of
the string.

/^(?!\[joe\])*.?(\[joe\].+?\[\\joe\]){1,3}(?!\[joe\])*.?$/

Joe said:
Daniel said:
Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.
 
J

Joe Peck

The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a
closing one.
 
A

Arne Brasseur

Joe said:
The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a
closing one.
I missed the beginning of this thread, but if I recall correctly from my
course on formal languages, this sort if thing can't be done with
regular expressions.

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You'd like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.
 
J

Joe Peck

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You'd like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.
It doesn't sound like Chinese :)

If wouldn't have to be an infinite amount of states. Let's say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there
would be a pretty easy way to do it.
 
D

dblack

Hi --

Joe said:
The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a closing
one.
I missed the beginning of this thread, but if I recall correctly from my
course on formal languages, this sort if thing can't be done with regular
expressions.

Regular expressions can be used to test whether a string belongs to a certain
regular language, which is a subset of all possible languages (where a
language is a set of strings). Regular expressions are equivalent to finite
state automata in this respect. Since a finite state automata can only be in
a finite number of states. You'd like to match a possibly infinitely large
number of [joe][/joe] pairs. The FSA would need a new state for every extra
[joe] it reads to remember it still needs to consume a matching [/joe] for
it.

If this sounds like Chinese, just remember regexpes aren't keen on matching
this sort of stuff. Stacks on the other hand seem to be custom designed for
these purposes.

If you're using scan, though, doesn't that mean that you're not really
trying to match one string to the regex, but rather a series of
strings? That means that the state machine gets completely restarted
as the scan method goes along the string. I think that's a different
situation. You're not really saying: match all the pairs; you're
saying: find the first substring that has a matching pair, then
discard it and don't worry about backtracking through it; etc.


David

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black)
aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)
 
W

Wilson Bilkovich

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You'd like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.
It doesn't sound like Chinese :)

If wouldn't have to be an infinite amount of states. Let's say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

To my knowledge, you can't do this with Ruby's current regexp engine,
though it is possible with Perl and .NET. Both of those languages
support something roughly analogous to a stack, within the expression.
I don't think Ruby 1.8's regexp engine is powerful enough to handle
this, but I would be happy to be proven wrong.

It's worth remembering that what we call 'regular expressions' these
days don't actually match the formal definition of that term, and are
much more powerful in some ways.
 
W

William James

Joe said:
Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You'd like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.
It doesn't sound like Chinese :)

If wouldn't have to be an infinite amount of states. Let's say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

If a regular expression can't do it, does that mean we can't use
a regular expression?

No. We'll still use a regexp and add some code to help it.

If all the pairs are matched, then after partitioning and zipping
we wind up with the original pairs.

[
"ok [joe] ok [/joe] right",
"ok [joe] [/joe] [joe] foo [/joe]",
"bad [joe] [/joe] foo [joe]",
"bad [joe] [/joe] foo [/joe]",
"bad [joe] [joe] foo [/joe]",
"bad [joe] [joe] ",
"bad [/joe] [joe] ",
"bad [/joe] [/joe] "
].each { |s|
ary = s.scan( %r{\[/?joe\]} )
p ary
if ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)
}.flatten
puts "good\n"
else
puts "bad\n"
end
}

--- output -----
["[joe]", "[/joe]"]
good
["[joe]", "[/joe]", "[joe]", "[/joe]"]
good
["[joe]", "[/joe]", "[joe]"]
bad
["[joe]", "[/joe]", "[/joe]"]
bad
["[joe]", "[joe]", "[/joe]"]
bad
["[joe]", "[joe]"]
bad
["[/joe]", "[joe]"]
bad
["[/joe]", "[/joe]"]
bad
 
A

Andrew Johnson

Joe said:
The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a
closing one.


If I am reading your specs correctly:

/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/


cheers,
andrew
 
W

William James

Andrew said:
Joe said:
The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a
closing one.


If I am reading your specs correctly:

/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/

[
"good [joe] [/joe] [joe] [/joe]",
"bad [/joe] [joe] [/joe]",
"bad [joe] [/joe] [/joe]"
].each { |s|
p s =~
/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
}
 
A

Andrew Johnson

William said:
Andrew said:
If I am reading your specs correctly:

/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/

[
"good [joe] [/joe] [joe] [/joe]",
"bad [/joe] [joe] [/joe]",
"bad [joe] [/joe] [/joe]"
].each { |s|
p s =~
/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
}

Quite right:

[
"good [joe] [/joe] [joe] [/joe]",
"bad [/joe] [joe] [/joe]",
"bad [joe] [/joe] [/joe]"
].each { |s|
p s if s =~
/^(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*$/
}

cheers
andrew
 
W

William James

Andrew said:
William said:
Andrew said:
If I am reading your specs correctly:

/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/

[
"good [joe] [/joe] [joe] [/joe]",
"bad [/joe] [joe] [/joe]",
"bad [joe] [/joe] [/joe]"
].each { |s|
p s =~
/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
}

Quite right:

[
"good [joe] [/joe] [joe] [/joe]",
"bad [/joe] [joe] [/joe]",
"bad [joe] [/joe] [/joe]"
].each { |s|
p s if s =~
/^(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*$/
}

cheers
andrew

Yours is faster for very short strings; longer strings allow the array
method to pull ahead.

require 'benchmark'

$n = 10_000
$strings = [
"good [joe] Wasn't that what he was seeking? [/joe]
[joe] Can't you see that? [/joe]",
"bad was Peck's boy [/joe] [joe] But he'll never know. [/joe]",
"bad to the bone [joe] Or will he?! [/joe] mish mash mush
Marching on Tom Tidler's ground fatigues me. [/joe]",
"bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]",
"bad: too few"
]

def regexp
$regexp_good = 0
$n.times{ $strings.each { |s|
$regexp_good += 1 if s =~
/\A(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*\Z/m
} }
end
def array
$array_good = 0
$n.times{ $strings.each { |s|
ary = s.scan( %r{\[/?joe\]} )
if [2,4,6].include?(ary.size) and
ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)}.
flatten
$array_good += 1
end
} }
end

Benchmark.bmbm do |x|
x.report("regexp") { regexp }
x.report("array") { array }
end

puts ; p $regexp_good, $array_good

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

user system total real
regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000
 
A

Andrew Johnson

William James wrote:
[snip]
Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

user system total real
regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000


The regex engine makes a difference in this case -- ruby1.8.5 + the
oniguruma engine gives:


Rehearsal ------------------------------------------
regexp 2.740000 0.010000 2.750000 ( 2.960385)
array 3.120000 0.000000 3.120000 ( 3.120808)
--------------------------------- total: 5.870000sec

user system total real
regexp 2.750000 0.000000 2.750000 ( 2.743031)
array 3.140000 0.000000 3.140000 ( 3.137098)

10000
10000


cheers,
andrew
 
W

William James

Andrew said:
William James wrote:
[snip]
Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

user system total real
regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000


The regex engine makes a difference in this case -- ruby1.8.5 + the
oniguruma engine gives:


Rehearsal ------------------------------------------
regexp 2.740000 0.010000 2.750000 ( 2.960385)
array 3.120000 0.000000 3.120000 ( 3.120808)
--------------------------------- total: 5.870000sec

user system total real
regexp 2.750000 0.000000 2.750000 ( 2.743031)
array 3.140000 0.000000 3.140000 ( 3.137098)

10000
10000


cheers,
andrew

Oh, yeah? Try this on for size, Oni!

require 'benchmark'

$n = 10_000
$strings = [
%Q{
good [joe] Wasn't that what he was seeking? [/joe]
[joe] Can't you see that? [/joe]
Early one June morning in 1872 I murdered my father---an act
which made a deep impression on me at the time.

We know better the needs of ourselves than of others. To
serve oneself is economy of administration.

self-evident, adj. Evident to one's self and to nobody else.

senate, n. A body of elderly gentlemen charged with high
duties and misdemeanors.
[joe]
"Thou wretch! -- thou vixen! -- thou shrew!" said I to my
wife on the morning after our wedding; "thou witch! -- thou
hag! -- thou whippersnapper -- thou sink of iniquity! --
thou fiery-faced quintessence of all that is abominable! --
thou -- thou--" here standing upon tiptoe, seizing her by the
throat, and placing my mouth close to her ear, I was [/joe]
preparing to launch forth a new and more decided epithet of
opprobrium, which should not fail, if ejaculated, to
convince her of her insignificance, when to my extreme
horror and astonishment I discovered that I had lost my
breath
.
},
"bad was Peck's boy [/joe] [joe] But he'll never know. [/joe]",
"bad to the bone [joe] Or will he?! [/joe] mish mash mush
Marching on Tom Tidler's ground fatigues me. [/joe]",
"bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]",
"bad: too few"
]

def regexp
$regexp_good = 0
$n.times{ $strings.each { |s|
$regexp_good += 1 if s =~
/\A(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*\Z/m
} }
end
def array
$array_good = 0
$n.times{ $strings.each { |s|
ary = s.scan( %r{\[/?joe\]} )
if [2,4,6].include?(ary.size) and
ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)}.
flatten
$array_good += 1
end
} }
end

Benchmark.bmbm do |x|
x.report("regexp") { regexp }
x.report("array") { array }
end

puts ; p $regexp_good, $array_good

Rehearsal ------------------------------------------
regexp 29.432000 3.354000 32.786000 ( 36.513000)
array 3.225000 0.000000 3.225000 ( 3.485000)
-------------------------------- total: 36.011000sec

user system total real
regexp 29.382000 3.656000 33.038000 ( 36.793000)
array 3.195000 0.000000 3.195000 ( 3.525000)

10000
10000
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top