Linear complete variation of MatchData#to_a, possible?

Trans · Apr 20, 2005

Hi--

I have a it bit of a puzzle for Regexp engine hackers out there. The
#to_a method on MatchData gives a simple list of matching portions of
the matched text.

md = /(1)(2(3))(4)/.match "012345"
md.to_a
["1234", "1", "23", "3", "4"]

But I would like a method the produces a linearly segmented list of the
text itself, like this:

["0", "1", "2", [ "3" ], "4", "5" ]

Notice the array depth corresponds to the subexpression depth.

My first stab was:

def matchlist
# good idea to cache?
@matchset ||= pre_match + self[1..-1] + post_match
end

But obviously that does not work when there are subexpressions.

Is such a method even possible?

Thanks,
T.

Pit Capitain · Apr 20, 2005

Trans said:
I have a it bit of a puzzle for Regexp engine hackers out there. The
#to_a method on MatchData gives a simple list of matching portions of
the matched text.

md = /(1)(2(3))(4)/.match "012345"
md.to_a
["1234", "1", "23", "3", "4"]

But I would like a method the produces a linearly segmented list of the
text itself, like this:

["0", "1", "2", [ "3" ], "4", "5" ]

Notice the array depth corresponds to the subexpression depth.

Do you really think that you want this nesting? I'd expect something like

[ "0", [ [ "1" ], [ "2", [ "3" ] ], [ "4" ] ], "5" ]

Could you try to specify the rules for your desired nesting and
splitting? I think it should be possible to implement something like this.

Regards,
Pit

Trans · Apr 20, 2005

Hi Capitain--

Pit said:
Do you really think that you want this nesting? I'd expect something like

[ "0", [ [ "1" ], [ "2", [ "3" ] ], [ "4" ] ], "5" ]

Could you try to specify the rules for your desired nesting and
splitting? I think it should be possible to implement something like

this.

Your nesting will work too. I excluded a couple layers were I felt it
wasn't significant. I think the pre and post matches can just be the
first and last elements respectively, so another that nesting isn't
relly needed (IMHO). But you may be quite right about the next layer. I
had thought it might be dropped without issue, but perhaps it is in
fact neccessary to keep the nesting consistant with the subexpressions.
Nonetheless, yes, you have the idea. Can it be done? I'm not sure how
one would "query" the level of subexpression nesting.

Thanks,
T.

David A. Black · Apr 20, 2005

Hi --

Pit said:
Hi Capitain--

Pit said:

Do you really think that you want this nesting? I'd expect something like

[ "0", [ [ "1" ], [ "2", [ "3" ] ], [ "4" ] ], "5" ]

Could you try to specify the rules for your desired nesting and
splitting? I think it should be possible to implement something like

Click to expand...

this.

Your nesting will work too. I excluded a couple layers were I felt it
wasn't significant. I think the pre and post matches can just be the
first and last elements respectively, so another that nesting isn't
relly needed (IMHO). But you may be quite right about the next layer. I
had thought it might be dropped without issue, but perhaps it is in
fact neccessary to keep the nesting consistant with the subexpressions.
Nonetheless, yes, you have the idea. Can it be done? I'm not sure how
one would "query" the level of subexpression nesting.

You could probably do something with MatchData#offset. The offsets of
the various captures can be compared to see whether or not they
overlap. No doubt there's a way to do this with inject (paging Robert
K.

David

Pit Capitain · Apr 20, 2005

Trans said:
Your nesting will work too. I excluded a couple layers were I felt it
wasn't significant. I think the pre and post matches can just be the
first and last elements respectively, so another that nesting isn't
relly needed (IMHO). But you may be quite right about the next layer. I
had thought it might be dropped without issue, but perhaps it is in
fact neccessary to keep the nesting consistant with the subexpressions.
Nonetheless, yes, you have the idea. Can it be done? I'm not sure how
one would "query" the level of subexpression nesting.

Hi Tom,

I'd try using MatchData#begin and MatchData#end:

md = /(1)(2(3))(4)/.match( "012345" )

md.size.times do |i|
puts( md.begin( i ) .. md.end( i ) )
end

The output is:

1..5
1..2
2..4
3..4
4..5

It should be possible to determine the splitting and the nesting out of
those numbers. If you're not successful, send me a mail and I'll try to
find some more time to help.

Regards,
Pit

Trans · Apr 20, 2005

Amazing Pit Capitain! Your hunch was right on the money. Not only was
your suggestion key to the solution, but the output was just as you
expected. Here's the solution I found. It's not all that elegant, but
it appears to work okay. I am certain there are much better solutions
to be had, so if anyone has one to offer...

class MatchData

def matchset
b = Hash.new(0)
e = Hash.new(0)

self.captures.size.times do |i|
b[self.begin(i)] += 1
e[self.end(i)] += 1
end

a = self.string.split(//)
c = ""
ca = []
stack = []

(a.size).times { |i|
# end
if e and e != 0
ca << c
e.times { ca = stack.pop }
c = nil
end
# begin
if b and b != 0
ca << c if c
b.times {
stack << ca
ca << []
ca = ca.last
}
c = nil
end
# content
c = "" unless c
c << a #if a
}
ca << c

return ca
end

end #class Matchdata

md = /(bb)(cc(dd))(ee)/.match "XXaabbccddeeffXX"
p md.to_a
p md.matchset
#=> ["XXaa", [["bb"], ["cc", ["dd"]], "ee"], "ffXX"]

Dominik Bathon · Apr 22, 2005

Amazing Pit Capitain! Your hunch was right on the money. Not only was
your suggestion key to the solution, but the output was just as you
expected. Here's the solution I found. It's not all that elegant, but
it appears to work okay. I am certain there are much better solutions
to be had, so if anyone has one to offer...

Here is my recursive version:

class MatchData

def matchtree(index=0)
ret=[]
b, e=self.begin(index), self.end(index)
while (index+=1)<=length
if index==length || (bi=self.begin(index))>=e
# we are finished, if something is left, then add it
ret << string[b, e-b] if e>b
break
else
if bi>=b
ret << string[b, bi-b] if bi>b
ret << matchtree(index)
b=self.end(index)
end
end
end
ret
end

def matchset
[pre_match, matchtree, post_match]
end

end

md = /(bb)(cc(dd))(ee)/.match "XXaabbccddeeffXX"
p md.to_a
p md.matchset
md.length.times { |i| p md.matchtree(i) }

Output:
["bbccddee", "bb", "ccdd", "dd", "ee"]
["XXaa", [["bb"], ["cc", ["dd"]], ["ee"]], "ffXX"]
[["bb"], ["cc", ["dd"]], ["ee"]]
["bb"]
["cc", ["dd"]]
["dd"]
["ee"]

Btw. your version seems to have a little problem:

#=> ["XXaa", [["bb"], ["cc", ["dd"]], "ee"], "ffXX"]

There are no brackets around the "ee".

Dominik

Trans · Apr 22, 2005

Dominik,

Much obliged and nicely done! A recrusive algortihm actually makes a
lot a sense.

Btw. your version seems to have a little problem:

Eek. I missed that --more like a big problem. But no matter, you saved
the day! If it's okay by you, I will include in Ruby Facets (credit
given to you, of course).

Thanks,
T.

Dominik Bathon · Apr 23, 2005

Dominik,

Much obliged and nicely done! A recrusive algortihm actually makes a
lot a sense.

Eek. I missed that --more like a big problem. But no matter, you saved
the day! If it's okay by you, I will include in Ruby Facets (credit
given to you, of course).

Sure, use it for whatever you want

MatchData matchset and matchtree	0	Apr 15, 2007
questions of idiom	3	Jun 7, 2010
How bad is $'? (Was: "Get substring of line")	4	Jan 18, 2005
FAQ update (roundup of pending requests - for comment)	6	Jan 7, 2004
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Linear complete variation of MatchData#to_a, possible?

Trans

Pit Capitain

Trans

David A. Black

Pit Capitain

Trans

Dominik Bathon

Trans

Dominik Bathon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads