Count substrings in string, scan too slow

D

Danny Challis

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4

Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,

Danny.
 
J

Jesús Gabriel y Galán

Hello everyone,
=A0 I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. =A0I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. =A0I do not
need to do regex matching or to process the matches, just count
substrings. =A0So what I want is something like this:

s =3D "you like to play with your yo-yo"
s.magical_count_method("yo") =3D> 4

Once again, what I'm really looking for is something fast. =A0I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:

require 'benchmark'

TIMES =3D 100_000
s =3D "you like to play with your yo-yo"

Benchmark.bmbm do |x|
x.report("scan") do
TIMES.times do
s.scan("yo").size
end
end
x.report("while") do
TIMES.times do
index =3D -1
count =3D 0
while (index =3D s.index("yo", index+1))
count +=3D 1
end
count
end
end
end

$ ruby scan_vs_while.rb
Rehearsal -----------------------------------------
scan 0.560000 0.020000 0.580000 ( 0.585972)
while 0.440000 0.060000 0.500000 ( 0.492969)
-------------------------------- total: 1.080000sec

user system total real
scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)

Don't know if this is enough for you, probably not :)

Jesus.
 
D

Danny Challis

Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.

Danny.
 
J

Jesús Gabriel y Galán

Thanks Jesus,
=A0 =A0This method actually decreased the runtime by quite a bit, so than= ks
for the help! =A0However, I still need something even faster if it exists= ,
so any other ideas would be appreciated. =A0I may have to just implement
this part is C or something.

I suppose that if you implement a C method that does what I did in
Ruby, that would be faster.
I mean doing the loop in C and calling String#index from there.

Jesus.
 
R

Robert Klemme

2010/6/24 Jes=FAs Gabriel y Gal=E1n said:
I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:

require 'benchmark'

TIMES =3D 100_000
s =3D "you like to play with your yo-yo"

Benchmark.bmbm do |x|
=A0x.report("scan") do
=A0 =A0TIMES.times do
=A0 =A0 =A0 =A0s.scan("yo").size
=A0 =A0end
=A0end
=A0x.report("while") do
=A0 =A0TIMES.times do
=A0 =A0 =A0 =A0index =3D -1
=A0 =A0 =A0 =A0count =3D 0
=A0 =A0 =A0 =A0while (index =3D s.index("yo", index+1))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0count +=3D 1
=A0 =A0 =A0 =A0end
=A0 =A0 =A0 =A0count
=A0 =A0end
=A0end
end

$ ruby scan_vs_while.rb
Rehearsal -----------------------------------------
scan =A0 =A00.560000 =A0 0.020000 =A0 0.580000 ( =A00.585972)
while =A0 0.440000 =A0 0.060000 =A0 0.500000 ( =A00.492969)
-------------------------------- total: 1.080000sec

=A0 =A0 =A0 =A0 =A0 =A0user =A0 =A0 system =A0 =A0 =A0total =A0 =A0 =A0 = =A0real
scan =A0 =A00.510000 =A0 0.010000 =A0 0.520000 ( =A00.519078)
while =A0 0.470000 =A0 0.020000 =A0 0.490000 ( =A00.493562)

Don't know if this is enough for you, probably not :)

I took the liberty to extend the benchmark a bit:

http://gist.github.com/451622

I would have expected regexp to be faster...

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
J

Jesús Gabriel y Galán

2010/6/24 Jes=FAs Gabriel y Gal=E1n said:
I took the liberty to extend the benchmark a bit:

http://gist.github.com/451622

I would have expected regexp to be faster...

This thing about adding the length of the match can be argued
depending on the requirements, I think.
What would you expect from:

"yoyoyoyo".magical_count_method("yoyo")

2 or 3?

If you add the length to the index you get 2. If you add 1, you get 3.


irb(main):018:0> s =3D "yoyoyoyo"
=3D> "yoyoyoyo"
irb(main):019:0> count =3D 0
=3D> 0
irb(main):020:0> len =3D s.length
=3D> 8
irb(main):021:0> search =3D "yoyo"
=3D> "yoyo"
irb(main):023:0> len =3D search.length
=3D> 4
irb(main):024:0> index =3D -len
=3D> -4
irb(main):025:0> while (index =3D s.index(search, index + len))
irb(main):026:1> count +=3D 1
irb(main):027:1> end
=3D> nil
irb(main):028:0> count
=3D> 2

irb(main):029:0> count =3D 0
=3D> 0
irb(main):030:0> index =3D -1
=3D> -1
irb(main):031:0> while (index =3D s.index(search, index + 1))
irb(main):032:1> count +=3D 1
irb(main):033:1> end
=3D> nil
irb(main):034:0> count
=3D> 3


So, I don't know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.

Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I've read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it's near the best or
worst case for each, but anyway.

Jesus.
 
D

Danny Challis

I'm looking for non-overlapping matches (so a 2 in your example)
I modified your code to do this for me like you showed and it works
fine. I was thinking of trying a Boyer-Moore implementation, but I
suspect if I implement this manually in Ruby it will be much slower.
 
M

Michael Fellinger

you don't like strscan ? :)
best regards -botp

I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: http://gist.github.com/451675
 
B

brabuhr

I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: http://gist.github.com/451675

http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Java)

require 'java'
java_import 'BoyerMoore'

x.report 'boyer_moore' do
count = BoyerMoore.match("yo", s).size
check count
end

$ jruby -v yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Client VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
scan 22.423000 0.000000 22.423000 ( 22.334000)
scan ++ 36.738000 0.000000 36.738000 ( 36.738000)
scan re 19.451000 0.000000 19.451000 ( 19.451000)
scan re ++ 39.222000 0.000000 39.222000 ( 39.222000)
while 22.621000 0.000000 22.621000 ( 22.622000)
strscan 29.075000 0.000000 29.075000 ( 29.076000)
boyer_moore 0.009000 0.000000 0.009000 ( 0.009000)
------------------------------------ total: 169.539000sec

user system total real
scan 18.050000 0.000000 18.050000 ( 18.051000)
scan ++ 35.046000 0.000000 35.046000 ( 35.046000)
scan re 17.807000 0.000000 17.807000 ( 17.807000)
scan re ++ 34.086000 0.000000 34.086000 ( 34.085000)
while 22.089000 0.000000 22.089000 ( 22.089000)
strscan 29.538000 0.000000 29.538000 ( 29.538000)
boyer_moore 0.005000 0.000000 0.005000 ( 0.004000)

$ jruby -v --server --fast yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Server VM 1.6.0_20) [i386-java]
yobench.rb:50 warning: Useless use of a variable in void context.
Rehearsal -----------------------------------------------
scan 17.340000 0.000000 17.340000 ( 17.154000)
scan ++ 23.986000 0.000000 23.986000 ( 23.987000)
scan re 15.170000 0.000000 15.170000 ( 15.169000)
scan re ++ 22.805000 0.000000 22.805000 ( 22.806000)
while 12.050000 0.000000 12.050000 ( 12.050000)
strscan 31.396000 0.000000 31.396000 ( 31.396000)
boyer_moore 0.010000 0.000000 0.010000 ( 0.010000)
------------------------------------ total: 122.756999sec

user system total real
scan 15.201000 0.000000 15.201000 ( 15.201000)
scan ++ 23.758000 0.000000 23.758000 ( 23.758000)
scan re 14.770000 0.000000 14.770000 ( 14.770000)
scan re ++ 22.455000 0.000000 22.455000 ( 22.455000)
while 12.182000 0.000000 12.182000 ( 12.182000)
strscan 24.497000 0.000000 24.497000 ( 24.497000)
boyer_moore 0.002000 0.000000 0.002000 ( 0.002000)
 
B

brabuhr

http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Jav=
a%29
=A0require 'java'
=A0java_import 'BoyerMoore'

=A0x.report 'boyer_moore' do
=A0 =A0count =3D BoyerMoore.match("yo", s).size
=A0 =A0check count
=A0end

:-( that wasn't the right one :)

x.report 'boyer_moore' do
TIMES.times do
count =3D BoyerMoore.match("yo", s).size
check count
end
end

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Client VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
boyer_moore 25.742000 0.000000 25.742000 ( 25.661000)
------------------------------------- total: 25.742000sec

user system total real
boyer_moore 24.869000 0.000000 24.869000 ( 24.869000)

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Server VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
boyer_moore 16.733000 0.000000 16.733000 ( 16.401000)
------------------------------------- total: 16.733000sec

user system total real
boyer_moore 15.970000 0.000000 15.970000 ( 15.971000)
 
B

botp

On Fri, Jun 25, 2010 at 1:16 AM, Michael Fellinger > I've just run
some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: http://gist.github.com/451675

that is not fair for strscan.. you are recreating the object inside the loop :)

outside loop do:
s=StringScanner.new "some string foo..."
s2=s.dup

inside loop do:
s=s2
.... s.scan_until...

best regards -botp
 
M

Michael Fellinger

On Fri, Jun 25, 2010 at 1:16 AM, Michael Fellinger > I've just run
some benchmarks with strscan, and it's at least in the

that is not fair for strscan.. you are recreating the object inside the loop :)

That's not fair for the others, and doesn't make any difference in the
benchmark anyway.
 
B

botp

That's not fair for the others,

indeed, in general. but if multiple/repeated processes are done on the
same string, then strscan will make very big difference.
and doesn't make any difference in the
benchmark anyway.

wc makes me think that it could be possible that ruby strings may be
strscan-ready without added init load :)

best regards -botp
 
C

Charles Oliver Nutter

:-( that wasn't the right one :)

=C2=A0x.report 'boyer_moore' do
=C2=A0 =C2=A0TIMES.times do
=C2=A0 =C2=A0 =C2=A0count =3D BoyerMoore.match("yo", s).size
=C2=A0 =C2=A0 =C2=A0check count
=C2=A0 =C2=A0end
=C2=A0end

FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it's decoding the
"yo" string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.

- Charlie
 
B

brabuhr

FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it's decoding the
"yo" string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.

I had a similar suspicion and had started a modified benchmark doing
fewer loops over larger data, but had to move on to other things.

This gives me a chance to try out the JRuby Mac Installer...

Original benchmark:

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) 64-Bit Server VM 1.6.0_20) [x86_64-java]
Rehearsal -----------------------------------------------
scan 8.851000 0.000000 8.851000 ( 8.784000)
scan ++ 14.186000 0.000000 14.186000 ( 14.186000)
scan re 8.594000 0.000000 8.594000 ( 8.594000)
scan re ++ 15.558000 0.000000 15.558000 ( 15.558000)
while 8.102000 0.000000 8.102000 ( 8.101000)
strscan 14.023000 0.000000 14.023000 ( 14.023000)
boyer_moore 7.446000 0.000000 7.446000 ( 7.446000)
------------------------------------- total: 76.760000sec

user system total real
scan 8.157000 0.000000 8.157000 ( 8.157000)
scan ++ 13.953000 0.000000 13.953000 ( 13.953000)
scan re 8.346000 0.000000 8.346000 ( 8.346000)
scan re ++ 15.332000 0.000000 15.332000 ( 15.333000)
while 8.087000 0.000000 8.087000 ( 8.087000)
strscan 14.303000 0.000000 14.303000 ( 14.303000)
boyer_moore 6.885000 0.000000 6.885000 ( 6.885000)

Even with the Ruby to Java call overhead, the Java BoyerMoore is
coming back the fastest on this machine. For comparison:

ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]
Rehearsal ----------------------------------------------
scan 31.030000 0.020000 31.050000 ( 31.094718)
scan ++ 62.310000 0.900000 63.210000 ( 63.227271)
scan re 31.030000 0.030000 31.060000 ( 31.110528)
scan re ++ 62.820000 0.870000 63.690000 ( 63.718876)
while 26.090000 0.020000 26.110000 ( 26.095308)
strscan 28.440000 0.010000 28.450000 ( 28.485140)
----------------------------------- total: 243.570000sec

user system total real
scan 31.240000 0.020000 31.260000 ( 31.264699)
scan ++ 64.000000 0.860000 64.860000 ( 64.865223)
scan re 31.570000 0.020000 31.590000 ( 31.581045)
scan re ++ 64.180000 0.980000 65.160000 ( 65.401667)
while 26.580000 0.030000 26.610000 ( 26.757658)
strscan 28.730000 0.030000 28.760000 ( 28.831860)

Unfortunately, I do not have 1.9.x on this machine at the moment.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top