finding positions in a string

B

bwv549

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here's the method:

def substring_positions(substring, string)
## fast, concise method??
end

my_substring = 'this'
my_string = 'this string this is a string this is it'

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

This seems trivial to do, but looking at StringScanner, String#match,
and String#scan, nothing simple comes to mind. There must be a one-
liner somewhere for this kind of thing. I checked through facets and
didn't see anything...

Thanks
 
D

dblack

Hi --

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here's the method:

def substring_positions(substring, string)
## fast, concise method??

Would you settle for concise? :)
end

my_substring = 'this'
my_string = 'this string this is a string this is it'

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

This seems trivial to do, but looking at StringScanner, String#match,
and String#scan, nothing simple comes to mind. There must be a one-
liner somewhere for this kind of thing. I checked through facets and
didn't see anything...

I'll get the ball rolling, though I doubt it's very efficient:

require 'enumerator'
def substring_positions(substring, string)
string.enum_for:)scan, substring).map { $~.offset(0)[0] }
end


David

--
* Books:
RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242)
RUBY FOR RAILS (http://www.manning.com/black)
* Ruby/Rails training
& consulting: Ruby Power and Light, LLC (http://www.rubypal.com)
 
B

bwv549

Well, here's what I have so far:

def substring_positions(substring, string)
substring_len = substring.size
s = StringScanner.new(string)
regexp = Regexp.new(Regexp.escape(substring))
positions = []
while s.scan_until(regexp)
positions << (s.pos - substring_len)
end
positions
end

In theory, it looks fairly fast, but it's not all that concise...

Anyone have anything better?

Thanks
 
B

bwv549

require 'enumerator'
def substring_positions(substring, string)
string.enum_for:)scan, substring).map { $~.offset(0)[0] }
end

That's the one-line magic I knew was hiding out there. I will use it
and see if it is acceptably fast. Thanks
 
P

Peña, Botp

From: bwv549 [mailto:[email protected]]=20
# substring_positions(my_substring, my_string) # -> should return
# [0, 12, 29]
#=20
# This seems trivial to do, but looking at StringScanner, String#match,
# and String#scan, nothing simple comes to mind. There must be a one-
# liner somewhere for this kind of thing. =20

not a raw one-liner, but you can create one ;)

root@pc4all:~# cat -n test.rb
1 class String
2 def scan_p ss
3 a =3D []
4 self.scan(ss){a << Regexp.last_match.begin(0)}
5 a
6 end
7 end
8
9 p "this string this is a string this is it".scan_p("this")
10
root@pc4all:~# ruby test.rb
[0, 12, 29]
root@pc4all:~#

i think the keyword is Matchdata.

kind regards -botp
 
J

James Edward Gray II

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here's the method:

def substring_positions(substring, string)
## fast, concise method??
end

my_substring = 'this'
my_string = 'this string this is a string this is it'

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

Another idea is to use String#index() with the optional second
minimum index parameter which you keep incrementing.

James Edward Gray II
 
S

Skye Shaw!@#$

Another idea is to use String#index() with the optional second
minimum index parameter which you keep incrementing.

James Edward Gray II


Indeed.

d-o-ibook-g4:~ d-o$ uname -a
Darwin d-o-ibook-g4.local 8.8.0 Darwin Kernel Version 8.8.0: Fri Sep
8 17:18:57 PDT 2006; root:xnu-792.12.6.obj~1/RELEASE_PPC Power
Macintosh powerp

d-o-ibook-g4:~ d-o$ cat bs.rb
require 'benchmark'
require 'enumerator'

COUNT=1_000_000

class String
def scan_p ss
a = []
self.scan(ss){a << Regexp.last_match.begin(0)}
a
end
end

substr = Proc.new do |s,seq|
pos=0
index=[]

while((i=s.index(seq,pos))!=nil)
index<<i
pos=i+seq.length
end

index
end

substr_positions = Proc.new do |string,substring|
string.enum_for:)scan, substring).map { $~.offset(0)[0] }
end

str="assbasscassmassthatassbass"
seq="ss"

Benchmark.bm do |x|
x.report("with index(): ") do; 1.upto(COUNT) do;
substr.call(seq,str); end end
x.report("with enum_for()/map(): ") do; 1.upto(COUNT) do;
substr_positions.call(str,seq); end end
x.report("with scan(): ") do; 1.upto(COUNT) do; str.scan_p(seq); end
end
end

d-o-ibook-g4:~ d-o$ ruby bs.rb
user system total real
with index(): 10.380000 0.060000 10.440000 ( 10.670129)
with enum_for()/map(): 88.740000 0.910000 89.650000 ( 98.726685)
with scan(): 57.620000 0.590000 58.210000 ( 65.996496)
 
P

Peña, Botp

From: Skye Shaw!@#$ [mailto:[email protected]]=20
# On Jul 23, 9:10 pm, James Edward Gray II <[email protected]>
# wrote:
# > Another idea is to use String#index() with the optional second =20
# > minimum index parameter which you keep incrementing.
#=20

cool and simple indeed.

# substr =3D Proc.new do |s,seq|
# pos=3D0
# index=3D[]
^^^^^
rename. it might confuse us nubies :)

#=20
# while((i=3Ds.index(seq,pos))!=3Dnil)

simplify. while i=3Ds.index(seq,pos)

# index<<i
# pos=3Di+seq.length
^^^^^
this does not change. init this outside loop.

# end
#=20
# index
# end


hmmm, even the recursive index method is not bad..

root@pc4all:~# cat -n test.rb
1 require 'benchmark'
2 require 'enumerator'
3
4 COUNT=3D100_000
5
6 class String
7 def scan_p ss
8 a =3D []
9 self.scan(ss){a << Regexp.last_match.begin(0)}
10 a
11 end
12
13 # recursive scan using index
14 def scan_p2 ss, pos =3D 0
15 return [] unless i =3D index(ss,pos)
16 + scan_p2(ss, i + ss.length)
17 end
18 end
19
20 class String
21 def scan_i seq
22 pos=3D0
23 ndx=3D[]
24 slen =3D seq.length
25 while i=3Dindex(seq,pos)
26 ndx << i
27 pos =3D i + slen
28 end
29 ndx
30 end
31 end
32
33 class String
34 #substr_positions =3D Proc.new do |string,substring|
35 def scan_enum substring
36 enum_for:)scan, substring).map { $~.offset(0)[0] }
37 end
38 end
39
40 str=3D"assbasscassmassthatassbass"
41 seq=3D"ss"
42
43 Benchmark.bm do |x|
44 x.report("with index(): ") do; 1.upto(COUNT) do;
45 str.scan_i(seq); end end
46 x.report("with enum_for()/map(): ") do; 1.upto(COUNT) do;
47 str.scan_enum(seq); end end
48 x.report("with scan_p(): ") do; 1.upto(COUNT) do; =
str.scan_p(seq); end end
49 x.report("with scan_p2(): ") do; 1.upto(COUNT) do; =
str.scan_p2(seq); end end
50
51 end
52
53
root@pc4all:~# ruby test.rb
user system total real
with index(): 4.760000 0.010000 4.770000 ( 4.774163)
with enum_for()/map(): 22.100000 0.010000 22.110000 ( 22.139630)
with scan_p(): 16.680000 0.010000 16.690000 ( 16.702391)
with scan_p2(): 9.940000 0.010000 9.950000 ( 9.955084)
root@pc4all:~#

not bad at 100k loops :)

root@pc4all:~# cat /proc/cpuinfo | egrep -i '(mhz|name)'
model name : Pentium II (Klamath)
cpu MHz : 300.061
root@pc4all:~# cat /proc/meminfo | grep -i mem
MemTotal: 255616 kB
MemFree: 7920 kB

kind regards -botp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top