regexp problem

Chai Muang thong · Nov 26, 2008

Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from "<<" and end with ">>"
3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly
4)There may be any unwanted string between matching pairs but never be
inside the pairs

A sample string is as followed:
str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"

The result string I am looking for from the above string should be as
followed:
["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

The closest I could come up with is str.split(/(<<[^>].*(>>)+)/) and I
got the result as followed:
["", "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> <<p6 <<p7 <<p8>>>>>>",
">>"]

I cannot change the rule (using "<<" and ">>), so I had little
experience on how to match 2 consecutive characters. It would be
appreciated if anyone has an input or suggestion. Thank you.

Chai

Brian Candler · Nov 26, 2008

Chai said:
Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from "<<" and end with ">>"
3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You'd need a parser for a
context-free grammar.

Having said that: many "regexp" implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps *are* very good at is tokenising:

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

# => ["<<", "p1 ", "<<", "p2", ">>", " ", "<<", "p3 ", "<<", "p4", ">>",
">>", ">>", " <and> ", "<<", "p5", ">>", " or ", "<<", "p6 ", "<<", "p7
", "<<", "p8", ">>", ">>", ">>"]

The constructs I've used are:

(?: ... ) grouping without capture
(?! ... ) zero-width negative lookahead (match if string
is not followed by ... without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
case t
when "<<"
res = "" if depth == 0
res << t
depth += 1
when ">>"
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end

Constantine Karnacevych · Nov 27, 2008

Brian said:
Chai said:

Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from "<<" and end with ">>"
3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly

Click to expand...

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You'd need a parser for a
context-free grammar.

Having said that: many "regexp" implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps *are* very good at is tokenising:

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

# => ["<<", "p1 ", "<<", "p2", ">>", " ", "<<", "p3 ", "<<", "p4", ">>",
">>", ">>", " <and> ", "<<", "p5", ">>", " or ", "<<", "p6 ", "<<", "p7
", "<<", "p8", ">>", ">>", ">>"]

The constructs I've used are:

(?: ... ) grouping without capture
(?! ... ) zero-width negative lookahead (match if string
is not followed by ... without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
case t
when "<<"
res = "" if depth == 0
res << t
depth += 1
when ">>"
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end

def match_pairs s, l, r
d, o, e = 0, [], '(' + Regexp.escape(l) + '|' + Regexp.escape(r) +
')'
while m = s =~ /#{e}/ do
$1 == l ? (o << "" if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $'
end
o
end

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
p match_pairs str, "<<", ">>"

=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

p match_pairs str, "<", ">"
=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<and>", "<<p5>>", "<<p6 <<p7
<<p8>>>>>>"]

str = '{p1 {p2} {p3 {p4}}} <and> {p5} or {p6 {p7 {p8}}}'
p match_pairs str, '{', '}'

=> ["{p1 {p2} {p3 {p4}}}", "{p5}", "{p6 {p7 {p8}}}"]

Chai Muang thong · Dec 2, 2008

Constantine said:
Brian Candler wrote:

def match_pairs s, l, r
d, o, e = 0, [], '(' + Regexp.escape(l) + '|' + Regexp.escape(r) +
')'
while m = s =~ /#{e}/ do
$1 == l ? (o << "" if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $'
end
o
end

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
p match_pairs str, "<<", ">>"

=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

p match_pairs str, "<", ">"
=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<and>", "<<p5>>", "<<p6 <<p7
<<p8>>>>>>"]

str = '{p1 {p2} {p3 {p4}}} <and> {p5} or {p6 {p7 {p8}}}'
p match_pairs str, '{', '}'

=> ["{p1 {p2} {p3 {p4}}}", "{p5}", "{p6 {p7 {p8}}}"]

--------------

Thank you very much Brian. It works fine as long as an incoming string
is valid. In other words, if the string has extra pair to the left, it
will return wrong result. For example a string "Test <tag> or <<not>
correct tag" will return ["<<"] instead of [""]. Anyway, I have solved
the problem by implementing a method to handle this instead of using
only regexp. Thank you again.

Chai

Chai Muang thong · Dec 2, 2008

Oops, I was pointing to wrong person in my last post. Brian method is
correct but not Karnacevych. Anyway, thank you very much for both
replies.

HOWTO: const and pointer variants in C and C++	10	Nov 11, 2008
MultiThreading in C++	6	Feb 6, 2007
distribution algo question	2	Mar 27, 2011
OOo and regexp	0	Dec 3, 2006
Python point location of intersect between two lines	0	Feb 28, 2018
problem in files	0	Jul 19, 2006
[ANN] NamedParameters v0.0.0 ('first', :second => 'release')	9	Jul 18, 2008
Program inefficiency?	17	Sep 29, 2007

regexp problem

Chai Muang thong

Brian Candler

Constantine Karnacevych

Chai Muang thong

Chai Muang thong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads