How about Enumerable#find_pattern?

A

A. S. Bradbury

Ignore the name, I don't really know what it's best to call it. Basically I've
been thinking it might be useful if Enumerable had a method to efficiently
search for a pattern in an Enumerable object. Existing methods like #find and
#grep work only on matching a single object.

This method might be implemented using the Knuth-Morris-Pratt algorithm
(http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm). It just seems to
me that it's a useful feature for Enumerable objects to have. A brute force
search is easy using #each_cons, and of course users can write their own
KMP-based searches, it just strikes me as something that has general utility.
There'd need to be one to return all matches and another to return only the
first. The method should return the position of the start of the match.
Comments?

Alex
 
D

dblack

Hi --

Ignore the name, I don't really know what it's best to call it. Basically I've
been thinking it might be useful if Enumerable had a method to efficiently
search for a pattern in an Enumerable object. Existing methods like #find and
#grep work only on matching a single object.

#find, yes, but #grep does something similar to what I understand you
to be describing:

%w{a b c}.grep(/./)
=> ["a", "b", "c"]
irb(main):002:0> str = "a\nb\nc"
=> "a\nb\nc"
irb(main):003:0> str.grep(/./)
=> ["a\n", "b\n", "c"]


David

--
David A. Black | (e-mail address removed)
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] http://www.manning.com/black | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org
 
A

A. S. Bradbury

#find, yes, but #grep does something similar to what I understand you
to be describing:

%w{a b c}.grep(/./)
=> ["a", "b", "c"]
irb(main):002:0> str = "a\nb\nc"
=> "a\nb\nc"
irb(main):003:0> str.grep(/./)
=> ["a\n", "b\n", "c"]

I'm thinking this:
foo=[1,2,3,4,5,6,5,4,3,2,1]
foo.find_pattern(5,6,5) #=> 4 (position of the start of this pattern)

Alex
 
A

A. S. Bradbury

class Array
def pattern( *args )
l = args.length
c = dup
while c.first(l) != c do
c.shift
return nil if c.length < l
end
length - c.length
end
end

That is one possible implementation, and this isn't something I'm having a
problem implementing. I'm just asking whether it would be appropriate for
there to be a pattern matcher for Enumerable objects in core implemented
using an efficient algorithm such as Knuth-Morris-Pratt. To me it just seems
useful to be able to efficiently search Enumerable objects without having to
implement it myself. It's a very small addition that just makes Enumerable
objects that much more useful.

Alex
 
A

A. S. Bradbury

class Array
=A0 =A0 def pattern( *args )
=A0 =A0 =A0 =A0 l =3D args.length
=A0 =A0 =A0 =A0 c =3D dup
=A0 =A0 =A0 =A0 while c.first(l) !=3D c do=20
=A0 =A0 =A0 =A0 =A0 =A0 c.shift
=A0 =A0 =A0 =A0 =A0 =A0 return nil if c.length < l
=A0 =A0 =A0 =A0 end
=A0 =A0 =A0 =A0 length - c.length
=A0 =A0 end
end

'while c.first(l) !=3D c do' should of course be:
while c.first(l) !=3D args do

Alex
 
A

A. S. Bradbury

Well, here's a (not very good) implementation, kind of ported from some python
code.

module Enumerable

def find_pattern(*pat)
match_length=0
match_pos=0
shifts=compute_table(pat)
self.each do |obj|
while (match_length >=0) and !(pat[match_length]===obj)
match_pos+=shifts[match_length]
match_length-=shifts[match_length]
end
match_length+=1
if match_length == pat.length
return match_pos
end
end
return nil # failed match
end

private
def compute_table(pat)
shifts=Array.new(pat.size)
shift = 1
0.upto pat.size do |i|
a=pat[i-1]
b=pat[i-shift-1]
while (shift < i) and (pat[i-1] != pat[i-shift-1])
shift += shifts[i-shift-1]
end
shifts=shift
end
return shifts
end
end

a="aaaabbaabbab".split(//)
a.find_pattern 'a', 'b', 'b' #=> 3
a.find_pattern 'c' #=> nil
a.find_pattern /a|b/, 'a', 'b' #=> 2

The idea is to allow efficient searching for patterns in the elements of any
Enumerable object, the example above is contrived.

Alex
 
R

Rick DeNatale

Well, here's a (not very good) implementation, kind of ported from some python
code.

module Enumerable

def find_pattern(*pat)
match_length=0
match_pos=0
shifts=compute_table(pat)
self.each do |obj|
while (match_length >=0) and !(pat[match_length]===obj)
match_pos+=shifts[match_length]
match_length-=shifts[match_length]
end
match_length+=1
if match_length == pat.length
return match_pos
end
end
return nil # failed match
end

private
def compute_table(pat)
shifts=Array.new(pat.size)
shift = 1
0.upto pat.size do |i|
a=pat[i-1]
b=pat[i-shift-1]
while (shift < i) and (pat[i-1] != pat[i-shift-1])
shift += shifts[i-shift-1]
end
shifts=shift
end
return shifts
end
end

a="aaaabbaabbab".split(//)
a.find_pattern 'a', 'b', 'b' #=> 3
a.find_pattern 'c' #=> nil
a.find_pattern /a|b/, 'a', 'b' #=> 2

The idea is to allow efficient searching for patterns in the elements of any
Enumerable object, the example above is contrived.


Nice, but it can fall down if the pattern contains a regexp which
matches more than one element:

a = "aaaabbabbab".split(//)
a.find_pattern /aab/, 'b' #=> nil

There's probably a way to fix that, but I'm not sure I can see how.

I did a pretty slavish translation to ruby of the KMP algorithm as
given in the Wikipedia article. It involved turning the enumerable
into either an array or a string so that it could be indexed instead
of using each. That didn't allow regexps at all though.

I've been working on hacking that to work with regexps, but that would
really only work if the enumerable was a string anyway, so it would
probably be better to do a specialized implementation in String.


--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

IPMS/USA Region 12 Coordinator
http://ipmsr12.denhaven2.com/

Visit the Project Mercury Wiki Site
http://www.mercuryspacecraft.com/
 
A

A. S. Bradbury

Nice, but it can fall down if the pattern contains a regexp which
matches more than one element:

a = "aaaabbabbab".split(//)
a.find_pattern /aab/, 'b' #=> nil

There's probably a way to fix that, but I'm not sure I can see how.

I'd actually argue that's expected behaviour and you're just using it
wrong(!). /aab/ simply doesn't match any element, so it *should* fail. The
main problem with my current implementation is it doesn't really skip ahead
when it should do, it just continues to the next iteration. This is where
it's easier when you're using an index and a while loop (and the object
responds_to []), but we can't assume that for an Enumerable-compatible
implementation.

The Enumerable mixin provides #to_a, but relying on this seems like a poor
solution...
I did a pretty slavish translation to ruby of the KMP algorithm as
given in the Wikipedia article. It involved turning the enumerable
into either an array or a string so that it could be indexed instead
of using each. That didn't allow regexps at all though.
I've been working on hacking that to work with regexps, but that would
really only work if the enumerable was a string anyway, so it would
probably be better to do a specialized implementation in String.

The string approach really doesn't seem like the right way to go about this,
if I understand you correctly. Not for the problem I'm trying to solve at
least. The idea of #find_pattern is that it will work for any arbitrary
object.

Perhaps you should be able to specify a comparison function in a block upon
calling find_pattern.

Alex

Alex
 
R

Rick DeNatale

I'd actually argue that's expected behaviour and you're just using it
wrong(!). /aab/ simply doesn't match any element, so it *should* fail.

Well, I'd expect that if you allow regexps that you should allow
regexps! But see below.
The
main problem with my current implementation is it doesn't really skip ahead
when it should do, it just continues to the next iteration. This is where
it's easier when you're using an index and a while loop (and the object
responds_to []), but we can't assume that for an Enumerable-compatible
implementation.

The Enumerable mixin provides #to_a, but relying on this seems like a poor
solution...

Not sure why.
The string approach really doesn't seem like the right way to go about this,
if I understand you correctly. Not for the problem I'm trying to solve at
least. The idea of #find_pattern is that it will work for any arbitrary
object.

But then regexps won't work except for a string.


Also, strings are a funny kind of enumerable here, since by default,
they just yield themselves in each, unless they include newlines. I'd
think that the general use of find_pattern in a string would be to
search for the pattern in the string, not as a pattern of the lines in
the string.

Perhaps you should be able to specify a comparison function in a block upon
calling find_pattern.

Of course this would require careful interface design to expand the
generality while preserving the efficiency.
 
A

A. S. Bradbury

Well, I'd expect that if you allow regexps that you should allow
regexps! But see below.

But why would it be expected behaviour for a regexp to try to match across
multiple items? The regexp is applied to a single item of the Enumerable
object, surely that's the only sensible way to deal with this?.

%w[This is an example].find_pattern /an example/ #=> false.
Surely you wouldn't expect the above snippet to automagically match?
Not sure why.

I'm just thinking that then you have to build the array in memory. It's never
going to be a big deal unless you have *really large* arrays. It just seems
cleaner to seek a solution that examines a single member of the Enumerable
object at a time.
But then regexps won't work except for a string.

Well no, #find_pattern isn't meant to be tied to regexp matching. You just
could use it for that. I'm not sure I'm really following you that well on
this point.
Also, strings are a funny kind of enumerable here, since by default,
they just yield themselves in each, unless they include newlines. I'd
think that the general use of find_pattern in a string would be to
search for the pattern in the string, not as a pattern of the lines in
the string.

Well, you'd use it in a way that is meaningful for your purposes. I'm
basically using this to find patterns in an array of tokens. If my
tokenization rules were simple, perhaps a regexp could be compiled, but it's
really not possible in this situation.

Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,342
Messages
2,571,407
Members
48,796
Latest member
katerack

Latest Threads

Top