regexp problem

C

Chai Muang thong

Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from "<<" and end with ">>"
3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly
4)There may be any unwanted string between matching pairs but never be
inside the pairs

A sample string is as followed:
str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"

The result string I am looking for from the above string should be as
followed:
["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

The closest I could come up with is str.split(/(<<[^>].*(>>)+)/) and I
got the result as followed:
["", "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> <<p6 <<p7 <<p8>>>>>>",
">>"]

I cannot change the rule (using "<<" and ">>), so I had little
experience on how to match 2 consecutive characters. It would be
appreciated if anyone has an input or suggestion. Thank you.

Chai
 
B

Brian Candler

Chai said:
Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from "<<" and end with ">>"
3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You'd need a parser for a
context-free grammar.

Having said that: many "regexp" implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps *are* very good at is tokenising:

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

# => ["<<", "p1 ", "<<", "p2", ">>", " ", "<<", "p3 ", "<<", "p4", ">>",
">>", ">>", " <and> ", "<<", "p5", ">>", " or ", "<<", "p6 ", "<<", "p7
", "<<", "p8", ">>", ">>", ">>"]

The constructs I've used are:

(?: ... ) grouping without capture
(?! ... ) zero-width negative lookahead (match if string
is not followed by ... without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
case t
when "<<"
res = "" if depth == 0
res << t
depth += 1
when ">>"
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end
 
C

Constantine Karnacevych

Brian said:
Chai said:
Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from "<<" and end with ">>"
3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You'd need a parser for a
context-free grammar.

Having said that: many "regexp" implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps *are* very good at is tokenising:

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

# => ["<<", "p1 ", "<<", "p2", ">>", " ", "<<", "p3 ", "<<", "p4", ">>",
">>", ">>", " <and> ", "<<", "p5", ">>", " or ", "<<", "p6 ", "<<", "p7
", "<<", "p8", ">>", ">>", ">>"]

The constructs I've used are:

(?: ... ) grouping without capture
(?! ... ) zero-width negative lookahead (match if string
is not followed by ... without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
case t
when "<<"
res = "" if depth == 0
res << t
depth += 1
when ">>"
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end

def match_pairs s, l, r
d, o, e = 0, [], '(' + Regexp.escape(l) + '|' + Regexp.escape(r) +
')'
while m = s =~ /#{e}/ do
$1 == l ? (o << "" if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $'
end
o
end

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
p match_pairs str, "<<", ">>"

=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

p match_pairs str, "<", ">"
=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<and>", "<<p5>>", "<<p6 <<p7
<<p8>>>>>>"]

str = '{p1 {p2} {p3 {p4}}} <and> {p5} or {p6 {p7 {p8}}}'
p match_pairs str, '{', '}'

=> ["{p1 {p2} {p3 {p4}}}", "{p5}", "{p6 {p7 {p8}}}"]
 
C

Chai Muang thong

Constantine said:
Brian Candler wrote:

def match_pairs s, l, r
d, o, e = 0, [], '(' + Regexp.escape(l) + '|' + Regexp.escape(r) +
')'
while m = s =~ /#{e}/ do
$1 == l ? (o << "" if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $'
end
o
end

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
p match_pairs str, "<<", ">>"

=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

p match_pairs str, "<", ">"
=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<and>", "<<p5>>", "<<p6 <<p7
<<p8>>>>>>"]

str = '{p1 {p2} {p3 {p4}}} <and> {p5} or {p6 {p7 {p8}}}'
p match_pairs str, '{', '}'

=> ["{p1 {p2} {p3 {p4}}}", "{p5}", "{p6 {p7 {p8}}}"]

--------------

Thank you very much Brian. It works fine as long as an incoming string
is valid. In other words, if the string has extra pair to the left, it
will return wrong result. For example a string "Test <tag> or <<not>
correct tag" will return ["<<"] instead of [""]. Anyway, I have solved
the problem by implementing a method to handle this instead of using
only regexp. Thank you again.

Chai
 
C

Chai Muang thong

Oops, I was pointing to wrong person in my last post. Brian method is
correct but not Karnacevych. Anyway, thank you very much for both
replies.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top