multiple regexp matches

K

Kevin Howe

I want to get multiple results of a regexp pattern match, offsets included.
The following code gets the proper results, but does not return offsets:

str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
print str.scan(re).inspect

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?
 
Z

Zach Dennis

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's
offset method to find your results.....i do believe.

Zach
 
K

Kevin Howe

Zach Dennis said:
Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's
offset method to find your results.....i do believe.

Yes that's true, but if you read the second part of my message, I'd already
tried this:
 
Z

Zach Dennis

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );


It returns the correct offsets of the matches. offset(0) being the whole
regex, offset(1) does the first subexpression, offset(2) does the second
subexpression. It works.

Zach
 
D

David A. Black

Hi --

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );


It returns the correct offsets of the matches. offset(0) being the whole
regex, offset(1) does the first subexpression, offset(2) does the second
subexpression. It works.

The problem is that Kevin wanted to scan a string more than once with
the same regex:

str = "abc abc abc"
re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

def find_offsets(str,re)
offsets = []
first = 0
of = [0,0]

loop do
break unless m = re.match(str[first..-1])
break if m.captures.empty?
m.captures.each_with_index do |c,i|
of = m.offset(i+1)
res = [c, [of[0]+first, of[1]+first ]]
yield res if block_given?
offsets << res
end
first += of[0]
end

offsets
end

# Little test:

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

puts str
(str.size/9).times { print "0123456789" }
puts; puts

find_offsets(str,re).each do |capture, (start, stop)|
puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
end

# Output:
<span id="1"> <span>...</span> </span>
0123456789012345678901234567890123456789

"<span>" starts at 14, ends at 20
"" starts at 15, ends at 15
"</span>" starts at 23, ends at 30
"/" starts at 24, ends at 25
"</span>" starts at 31, ends at 38
"/" starts at 32, ends at 33


David
 
A

Austin Ziegler

This should do what you want.

-austin

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end
 
Z

Zach Dennis

Ah...thanks for the clarification David. I was mistaken.

Sorry for the confusion Kevin.

Zach
Hi --

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );


It returns the correct offsets of the matches. offset(0) being the whole
regex, offset(1) does the first subexpression, offset(2) does the second
subexpression. It works.

The problem is that Kevin wanted to scan a string more than once with
the same regex:

str = "abc abc abc"
re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

def find_offsets(str,re)
offsets = []
first = 0
of = [0,0]

loop do
break unless m = re.match(str[first..-1])
break if m.captures.empty?
m.captures.each_with_index do |c,i|
of = m.offset(i+1)
res = [c, [of[0]+first, of[1]+first ]]
yield res if block_given?
offsets << res
end
first += of[0]
end

offsets
end

# Little test:

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

puts str
(str.size/9).times { print "0123456789" }
puts; puts

find_offsets(str,re).each do |capture, (start, stop)|
puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
end

# Output:
<span id="1"> <span>...</span> </span>
0123456789012345678901234567890123456789

"<span>" starts at 14, ends at 20
"" starts at 15, ends at 15
"</span>" starts at 23, ends at 30
"/" starts at 24, ends at 25
"</span>" starts at 31, ends at 38
"/" starts at 32, ends at 33


David
 
K

Kevin Howe

Awesome that works great thank you. I have to wonder why Ruby doesn't have
this built in, it's simple enough to add a method that returns a list of
MatchData objects as follows:

class MultiRegexp < Regexp
def matches(str)
str.scan(self) do
yield Regexp.last_match
end
end
end

str = '<span id="1"> <span>...</span> </span>'
re = MultiRegexp.new('(<(\/?)span>)', true)
re.matches(str) { |i|
capture = i.captures[0]
start,stop = i.offset(0)
puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
}

An even nicer alternative would be to add a Regexp::MULTIMATCH constant:

str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span>)', Regexp::MULTIMATCH)
matches = re.match(str)

Just a thought :)


Zach Dennis said:
Ah...thanks for the clarification David. I was mistaken.

Sorry for the confusion Kevin.

Zach
Hi --

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );


It returns the correct offsets of the matches. offset(0) being the whole
regex, offset(1) does the first subexpression, offset(2) does the second
subexpression. It works.

The problem is that Kevin wanted to scan a string more than once with
the same regex:

str = "abc abc abc"
re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

def find_offsets(str,re)
offsets = []
first = 0
of = [0,0]

loop do
break unless m = re.match(str[first..-1])
break if m.captures.empty?
m.captures.each_with_index do |c,i|
of = m.offset(i+1)
res = [c, [of[0]+first, of[1]+first ]]
yield res if block_given?
offsets << res
end
first += of[0]
end

offsets
end

# Little test:

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

puts str
(str.size/9).times { print "0123456789" }
puts; puts

find_offsets(str,re).each do |capture, (start, stop)|
puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
end

# Output:
<span id="1"> <span>...</span> </span>
0123456789012345678901234567890123456789

"<span>" starts at 14, ends at 20
"" starts at 15, ends at 15
"</span>" starts at 23, ends at 30
"/" starts at 24, ends at 25
"</span>" starts at 31, ends at 38
"/" starts at 32, ends at 33


David
 
R

Robert Klemme

Austin Ziegler said:
This should do what you want.

-austin

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something like

/o/.each( "foo" ) do |md|
# md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
# md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?

robert
 
S

Simon Strandgaard

This should do what you want.

-austin

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something
like

/o/.each( "foo" ) do |md|
# md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
# md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?


What about $~ ?

bash-2.05b$ ruby a.rb
[[0, 13], [14, 20], [23, 30], [31, 38]]
bash-2.05b$ expand -t2 a.rb
str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
positions = []
str.scan(re) do
positions << [$~.begin(0), $~.end(0)]
end
p positionsbash-2.05b$
 
R

Robert Klemme

Simon Strandgaard said:
This should do what you want.

-austin

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something
like

/o/.each( "foo" ) do |md|
# md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
# md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?


What about $~ ?

bash-2.05b$ ruby a.rb
[[0, 13], [14, 20], [23, 30], [31, 38]]
bash-2.05b$ expand -t2 a.rb
str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
positions = []
str.scan(re) do
positions << [$~.begin(0), $~.end(0)]
end
p positionsbash-2.05b$

This has the same problem, only that in this case you don't use a class
method but a global variable. Both of them are not in any way connected to
the regexp you use other than through a hidden side effect of the matching
process. I like more explicit connection similar to the one I suggested.

Kind regards

robert
 
A

Austin Ziegler

Austin Ziegler said:
str = '<span id="1"> <span> ...</span> </span> '
re = /(<(\/?)span> )/i

str.scan(re)
# => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end
While that works, isn't it ridiculous that one has to resort to a
class method ("Regexp.last_match")? I mean, there should rather be
something like

/o/.each( "foo" ) do |md|
# md is MatchData
end

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings. There are probably others, but those
are the ones that come to mind. This *will* break some code,
unfortunately, but that can be mitigated by adding #to_str. IMO,
this will make #gsub much easier to deal with, as you won't have to
resort to either Regexp.last_match or $[0-9] variables to be able to
work with captures. My Regexp.last_match call only presumes that
Regexp.last_match is actually threadsafe, whereas we know that the
ugly Perlish $ variables are threadsafe. I think this is an
acceptable level of incompatibility because of the use of #to_str
and the amount of flexibility that would be gained. As far as I
know, it wouldn't require *that* big a change, because for
Regexp.last_match to work, there must still be a MatchData object
*somewhere*.

What do you think?

-austin
 
S

Simon Strandgaard

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings.
[snip]

Agree.. this would be nice.. I think I have seen an RCR about it long time
ago (but I cannot locate that RCR).

btw: my ruby regexp engine does so.. it yields matchdata instead of string.
http://raa.ruby-lang.org/project/regexp/
 
F

Florian Gross

Austin said:
There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings.

I agree with this and it seems that matz only hasn't done this yet,
because of backwards compatibility.

I'm referring to this posting of him:

http://groups.google.com/[email protected]
What do you think?

I heavily agree with this. It's the way it should have been since the
beginning. #to_str sounds like a way that shouldn't break to much code
and Ruby could issue a migration warning when it is called.

Rite was said to sacrifice compatibility for the cost of more elegance
so now might be a good time for switching.

Regards,
Florian Gross
 
R

Robert Klemme

Austin Ziegler said:
Austin Ziegler said:
str = '<span id="1"> <span> ...</span> </span> '
re = /(<(\/?)span> )/i

str.scan(re)
# => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end
While that works, isn't it ridiculous that one has to resort to a
class method ("Regexp.last_match")? I mean, there should rather be
something like

/o/.each( "foo" ) do |md|
# md is MatchData
end

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings. There are probably others, but those
are the ones that come to mind. This *will* break some code,
unfortunately, but that can be mitigated by adding #to_str. IMO,
this will make #gsub much easier to deal with, as you won't have to
resort to either Regexp.last_match or $[0-9] variables to be able to
work with captures. My Regexp.last_match call only presumes that
Regexp.last_match is actually threadsafe, whereas we know that the
ugly Perlish $ variables are threadsafe. I think this is an
acceptable level of incompatibility because of the use of #to_str
and the amount of flexibility that would be gained. As far as I
know, it wouldn't require *that* big a change, because for
Regexp.last_match to work, there must still be a MatchData object
*somewhere*.

What do you think?

I like the functionality very much, but I'd prefer to *not* change the
behavior of String#scan, #sub, and #gsub. I'd rather have Regexp#scan(str,
&block), Regexp#sub(str, replace=nil, &block) and Regexp#gsub(str,
replace=nil, &block) that yield MatchData if there is a block. There might
be other names but since the behavior is quite similar to those methods in
String these names are propably good. The only drawback I can see is that
they might cause confusion ("Which were the ones that yielded MatchData?"),
but IMHO people can cope with this - especially since old behavior does not
change. (Personally I would find it easy to remember that Regexp <->
MatchData and String <-> String or Array of String.)

Kind regards

robert
 
A

Austin Ziegler

Here is the RCR I will be submitting. There is a server error on
rcrchive that prevents me from submitting it there.

Make String#scan, #gsub, and #sub yield MatchData objects
backwards compatibility [x]

Abtract:
A "least-break" change to <code> String#scan</code>,
<code>#gsub</code>, and <code> #sub</code> to provide the MatchData to
attached code blocks.

Problem:
<code> String#scan</code>, <code> #gsub</code>, and <code> #sub</code>
yield the string value of the matched regular expression to a provided
block, which is of very limited value. Currently, we must rely upon
either ugly numeric match variables (<code> $1</code> - <code>
$9</code>, etc.) or a class method (<code> Regexp.last_match</code) to
obtain the match.

<pre>str = '<span id="1"> <span> ...</span> </span> '
re = /(<(\/?)span> )/i

str.scan(re)
# => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
match.captures.each_with_index do |capture, ii|
soff, eoff = match.offset(ii + 1)
puts %Q("#{capture}" #{soff} .. #{eoff})
end
end</pre>

Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.

Analysis:
I have written code as noted in the problem section; it feels
unnecessarily complex and fragile. This change will work in all cases
where a single string is provided; it will require a change to code
that deals with array values (e.g., String#scan with groups are
provided (because of the use of rb_reg_nth_match in scan_once);
switching to the use of MatchData#captures by the developers will work
just fine.

Implementation:
I *think* that the changes look something like this:
<pre>
--- re.c.old 2004-08-22 00:24:09 Eastern Daylight Time
+++ re.c 2004-08-22 00:18:50 Eastern Daylight Time

@@ -2320,6 +2320,7 @@
rb_define_method(rb_cMatch, "pre_match", rb_reg_match_pre, 0);
rb_define_method(rb_cMatch, "post_match", rb_reg_match_post, 0);
rb_define_method(rb_cMatch, "to_s", match_to_s, 0);
+ rb_define_method(rb_cMatch, "to_str", match_to_s, 0);
rb_define_method(rb_cMatch, "inspect", rb_any_to_s, 0); /* in object.c */
rb_define_method(rb_cMatch, "string", match_string, 0);
}

--- string.c.old 2004-08-22 00:24:10 Eastern Daylight Time
+++ string.c 2004-08-22 00:20:35 Eastern Daylight Time

@@ -1928,7 +1928,7 @@

if (iter) {
rb_match_busy(match);
- repl = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
+ repl = rb_obj_as_string(rb_yield(0, match));
rb_backref_set(match);
}
else {
@@ -2043,7 +2043,7 @@
regs = RMATCH(match)-> regs;
if (iter) {
rb_match_busy(match);
- val = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
+ val = rb_obj_as_string(rb_yield(match));
rb_backref_set(match);
}
else {
@@ -4164,15 +4164,7 @@
else {
*start = END(0);
}
- if (regs-> num_regs == 1) {
- return rb_reg_nth_match(0, match);
- }
- result = rb_ary_new2(regs-> num_regs);
- for (i=1; i < regs-> num_regs; i++) {
- rb_ary_push(result, rb_reg_nth_match(i, match));
- }
-
- return result;
+ return match;
}
return Qnil;
}
</pre>

I'm not 100% sure that this is right, and I haven't tested it. The
equivalent Ruby code would be (note: this code appears to work, but
it does cause problems with irb):

<pre>class MatchData
def to_str
self.to_s
end
end

class String
alias_method :eek:ld_scan, :scan
alias_method :eek:ld_gsub!, :gsub!
alias_method :eek:ld_sub!, :sub!

def scan(pattern)
if block_given?
old_scan(pattern) { yield Regexp.last_match }
else
old_scan(pattern)
end
end

def gsub(pattern, repl = nil, &block)
s = self.dup
s.gsub!(pattern, repl, &block)
s
end

def gsub!(pattern, repl = nil)
if block_given? and repl.nil?
old_gsub!(pattern) { yield Regexp.last_match }
elsif repl.nil?
old_gsub!(pattern)
else
old_gsub!(pattern, repl)
end
end

def sub(pattern, repl = nil, &block)
s = self.dup
s.sub!(pattern, repl, &block)
s
end

def sub!(pattern, repl = nil)
if block_given? and repl.nil?
old_sub!(pattern) { yield Regexp.last_match }
elsif repl.nil?
old_sub!(pattern)
else
old_sub!(pattern, repl)
end
end
end</pre>
 
N

nobu.nokada

Hi,

At Sun, 22 Aug 2004 14:00:45 +0900,
Austin Ziegler wrote in [ruby-talk:110110]:
Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.

#to_str doesn't solve everything. MatchData#[] returns a matched
portion for sub-patterns, whereas String#[] returns a byte at
the position.
 
A

Austin Ziegler

At Sun, 22 Aug 2004 14:00:45 +0900,
Austin Ziegler wrote in [ruby-talk:110110]:
Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.
#to_str doesn't solve everything. MatchData#[] returns a matched
portion for sub-patterns, whereas String#[] returns a byte at
the position.

Agreed. It also is 100% incompatible on #scan with groups in the
regexp (e.g., "foobar".scan(/(..)(.)/) will yield [["fo", "o"], ["ba",
"b"]]. This is the argument for Regexp#scan instead of modifying
String#scan. However, this is something that I believe should be
changed. An alternative is to yield both the normal values and the
match -- but that itself will be incompatible with #scan and most
current uses of #gsub and #sub that use the match value.

Yet another alternative is to add an optional parameter in all cases.
String#gsub currently expects a regexp and a replace pattern OR a
regexp and a block. #gsub could be modified such that when it gets a
regexp, a "boolean", and a block, it yields something different. This
could be, for example:

String#gsub(pattern, true) { |match_data| ... }
String#gsub(pattern) { |string| ... }

I would actually rather see the opposite form, if we do this:

String#gsub(pattern, true) { |string| ... }
String#gsub(pattern) { |match_data| ... }

This would encourage the use of the new form. By doing it this way, a
transition period can be introduced for this (e.g., it in 1.8.3 it may
warn that the current replace will be changed to yield a match_data
instead of a string; in 1.9 it yields a match_data instead of a
string).

I have *not* analysed code out there that uses #gsub/#scan/#sub, but I
think that this is an ideal change.

-austin (I'm also adding this to the discussion on RCR276)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top