Linear complete variation of MatchData#to_a, possible?

T

Trans

Hi--

I have a it bit of a puzzle for Regexp engine hackers out there. The
#to_a method on MatchData gives a simple list of matching portions of
the matched text.

md = /(1)(2(3))(4)/.match "012345"
md.to_a
["1234", "1", "23", "3", "4"]

But I would like a method the produces a linearly segmented list of the
text itself, like this:

["0", "1", "2", [ "3" ], "4", "5" ]

Notice the array depth corresponds to the subexpression depth.

My first stab was:

def matchlist
# good idea to cache?
@matchset ||= pre_match + self[1..-1] + post_match
end

But obviously that does not work when there are subexpressions.

Is such a method even possible?

Thanks,
T.
 
P

Pit Capitain

Trans said:
I have a it bit of a puzzle for Regexp engine hackers out there. The
#to_a method on MatchData gives a simple list of matching portions of
the matched text.

md = /(1)(2(3))(4)/.match "012345"
md.to_a
["1234", "1", "23", "3", "4"]

But I would like a method the produces a linearly segmented list of the
text itself, like this:

["0", "1", "2", [ "3" ], "4", "5" ]

Notice the array depth corresponds to the subexpression depth.

Do you really think that you want this nesting? I'd expect something like

[ "0", [ [ "1" ], [ "2", [ "3" ] ], [ "4" ] ], "5" ]

Could you try to specify the rules for your desired nesting and
splitting? I think it should be possible to implement something like this.

Regards,
Pit
 
T

Trans

Hi Capitain--

Pit said:
Do you really think that you want this nesting? I'd expect something like

[ "0", [ [ "1" ], [ "2", [ "3" ] ], [ "4" ] ], "5" ]

Could you try to specify the rules for your desired nesting and
splitting? I think it should be possible to implement something like
this.

Your nesting will work too. I excluded a couple layers were I felt it
wasn't significant. I think the pre and post matches can just be the
first and last elements respectively, so another that nesting isn't
relly needed (IMHO). But you may be quite right about the next layer. I
had thought it might be dropped without issue, but perhaps it is in
fact neccessary to keep the nesting consistant with the subexpressions.
Nonetheless, yes, you have the idea. Can it be done? I'm not sure how
one would "query" the level of subexpression nesting.

Thanks,
T.
 
D

David A. Black

Hi --

Hi Capitain--

Pit said:
Do you really think that you want this nesting? I'd expect something like

[ "0", [ [ "1" ], [ "2", [ "3" ] ], [ "4" ] ], "5" ]

Could you try to specify the rules for your desired nesting and
splitting? I think it should be possible to implement something like
this.

Your nesting will work too. I excluded a couple layers were I felt it
wasn't significant. I think the pre and post matches can just be the
first and last elements respectively, so another that nesting isn't
relly needed (IMHO). But you may be quite right about the next layer. I
had thought it might be dropped without issue, but perhaps it is in
fact neccessary to keep the nesting consistant with the subexpressions.
Nonetheless, yes, you have the idea. Can it be done? I'm not sure how
one would "query" the level of subexpression nesting.

You could probably do something with MatchData#offset. The offsets of
the various captures can be compared to see whether or not they
overlap. No doubt there's a way to do this with inject (paging Robert
K. :)


David
 
P

Pit Capitain

Trans said:
Your nesting will work too. I excluded a couple layers were I felt it
wasn't significant. I think the pre and post matches can just be the
first and last elements respectively, so another that nesting isn't
relly needed (IMHO). But you may be quite right about the next layer. I
had thought it might be dropped without issue, but perhaps it is in
fact neccessary to keep the nesting consistant with the subexpressions.
Nonetheless, yes, you have the idea. Can it be done? I'm not sure how
one would "query" the level of subexpression nesting.

Hi Tom,

I'd try using MatchData#begin and MatchData#end:

md = /(1)(2(3))(4)/.match( "012345" )

md.size.times do |i|
puts( md.begin( i ) .. md.end( i ) )
end

The output is:

1..5
1..2
2..4
3..4
4..5

It should be possible to determine the splitting and the nesting out of
those numbers. If you're not successful, send me a mail and I'll try to
find some more time to help.

Regards,
Pit
 
T

Trans

Amazing Pit Capitain! Your hunch was right on the money. Not only was
your suggestion key to the solution, but the output was just as you
expected. Here's the solution I found. It's not all that elegant, but
it appears to work okay. I am certain there are much better solutions
to be had, so if anyone has one to offer...

class MatchData

def matchset
b = Hash.new(0)
e = Hash.new(0)

self.captures.size.times do |i|
b[self.begin(i)] += 1
e[self.end(i)] += 1
end

a = self.string.split(//)
c = ""
ca = []
stack = []

(a.size).times { |i|
# end
if e and e != 0
ca << c
e.times { ca = stack.pop }
c = nil
end
# begin
if b and b != 0
ca << c if c
b.times {
stack << ca
ca << []
ca = ca.last
}
c = nil
end
# content
c = "" unless c
c << a #if a
}
ca << c

return ca
end

end #class Matchdata

md = /(bb)(cc(dd))(ee)/.match "XXaabbccddeeffXX"
p md.to_a
p md.matchset
#=> ["XXaa", [["bb"], ["cc", ["dd"]], "ee"], "ffXX"]
 
D

Dominik Bathon

Amazing Pit Capitain! Your hunch was right on the money. Not only was
your suggestion key to the solution, but the output was just as you
expected. Here's the solution I found. It's not all that elegant, but
it appears to work okay. I am certain there are much better solutions
to be had, so if anyone has one to offer...

Here is my recursive version:

class MatchData

def matchtree(index=0)
ret=[]
b, e=self.begin(index), self.end(index)
while (index+=1)<=length
if index==length || (bi=self.begin(index))>=e
# we are finished, if something is left, then add it
ret << string[b, e-b] if e>b
break
else
if bi>=b
ret << string[b, bi-b] if bi>b
ret << matchtree(index)
b=self.end(index)
end
end
end
ret
end

def matchset
[pre_match, matchtree, post_match]
end

end

md = /(bb)(cc(dd))(ee)/.match "XXaabbccddeeffXX"
p md.to_a
p md.matchset
md.length.times { |i| p md.matchtree(i) }


Output:
["bbccddee", "bb", "ccdd", "dd", "ee"]
["XXaa", [["bb"], ["cc", ["dd"]], ["ee"]], "ffXX"]
[["bb"], ["cc", ["dd"]], ["ee"]]
["bb"]
["cc", ["dd"]]
["dd"]
["ee"]

Btw. your version seems to have a little problem:
#=> ["XXaa", [["bb"], ["cc", ["dd"]], "ee"], "ffXX"]
There are no brackets around the "ee".

Dominik
 
T

Trans

Dominik,

Much obliged and nicely done! A recrusive algortihm actually makes a
lot a sense.
Btw. your version seems to have a little problem:

Eek. I missed that --more like a big problem. But no matter, you saved
the day! If it's okay by you, I will include in Ruby Facets (credit
given to you, of course).

Thanks,
T.
 
D

Dominik Bathon

Dominik,

Much obliged and nicely done! A recrusive algortihm actually makes a
lot a sense.


Eek. I missed that --more like a big problem. But no matter, you saved
the day! If it's okay by you, I will include in Ruby Facets (credit
given to you, of course).

Sure, use it for whatever you want :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,119
Latest member
IrmaNorcro
Top