String.scan - catching overlapping patterns with lookahead

Ardanwen · Dec 16, 2004

Hi all.

I wrote a small script to count the number of occurrences of one string
in another string (including somewhat overlapping occurrences). Then I
found out about the .scan method, which speeded up some things, but
unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
returns only one "ana")

(This was the script I had before i found out about scan

------
$proteasome.each {|x|
i = VIRUS #virus length
begin
i = virus[0,i + 4].rindex(x)
count_prot += 1 if i != nil
end until i == nil
}
------

proteasome is an array filled with 12 strings ("00010" , "00100" ,
"10110" etc,
virus is a long string of 1's and 0's (~3000 in total).

Initially, I changed the following, which speeded up the whole program
about 25% but suffered from the banana problem (most of the program's
energy goes into the above algorithm)

------
$proteasome.each {|x|
virus.scan(x) {
count_prot += 1
}
}
------

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Thanks!,
Boris

Robert Klemme · Dec 16, 2004

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Do you have any metacharacters in the strings contained in $proteasome?
In that case you should use Regexp.escape.

Although it might be a small optimization, you could split your search
string at another position than the first. For example, if you search for
"foofii" then you can split after the second "o" without loosing anything.
In fact you could even skip the lookahead for "foofii" completely, because
if it matches the earliest next match is after the complete string.
However there's a certain overhead involved in finding this position so it
might not be worth it if the strings you search through are short. If you
search through huge piles of data then it might be worthwile.

See also http://en.wikipedia.org/wiki/String_searching_algorithm

Kind regards

robert

David A. Black · Dec 16, 2004

Hi all.

I wrote a small script to count the number of occurrences of one string
in another string (including somewhat overlapping occurrences). Then I
found out about the .scan method, which speeded up some things, but
unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
returns only one "ana")

(This was the script I had before i found out about scan
------
$proteasome.each {|x|
i = VIRUS #virus length
begin
i = virus[0,i + 4].rindex(x)
count_prot += 1 if i != nil
end until i == nil
}
------

proteasome is an array filled with 12 strings ("00010" , "00100" ,
"10110" etc,
virus is a long string of 1's and 0's (~3000 in total).

Initially, I changed the following, which speeded up the whole program
about 25% but suffered from the banana problem (most of the program's
energy goes into the above algorithm)

------
$proteasome.each {|x|
virus.scan(x) {
count_prot += 1
}
}
------

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Are you sure about the benchmarks? I got results that suggest that
the plain scan version is much faster, assuming I've adapted your
original code correctly:

require 'benchmark'
include Benchmark

str = "bananaaananaaana" * 100
n = 5000

def boris(str)
c = 0
i = str.size
begin
i = str[0,i + 2].rindex("ana")
c += 1 if i != nil
end until i == nil
c
end

def david(str)
str.scan(/(?=ana)/).size
end

bm do |x|
x.report { n.times { boris(str) } }
x.report { n.times { david(str) } }
end

Results:

$ ruby ban.rb
user system total real
27.460000 0.490000 27.950000 ( 28.048449)
8.790000 0.020000 8.810000 ( 8.800876)

David

Robert Klemme · Dec 16, 2004

David A. Black said:
Hi all.

I wrote a small script to count the number of occurrences of one string
in another string (including somewhat overlapping occurrences). Then I
found out about the .scan method, which speeded up some things, but
unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
returns only one "ana")

(This was the script I had before i found out about scan
------
$proteasome.each {|x|
i = VIRUS #virus length
begin
i = virus[0,i + 4].rindex(x)
count_prot += 1 if i != nil
end until i == nil
}
------

proteasome is an array filled with 12 strings ("00010" , "00100" ,
"10110" etc,
virus is a long string of 1's and 0's (~3000 in total).

Initially, I changed the following, which speeded up the whole program
about 25% but suffered from the banana problem (most of the program's
energy goes into the above algorithm)

------
$proteasome.each {|x|
virus.scan(x) {
count_prot += 1
}
}
------

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Click to expand...

Are you sure about the benchmarks? I got results that suggest that
the plain scan version is much faster, assuming I've adapted your
original code correctly:

require 'benchmark'
include Benchmark

str = "bananaaananaaana" * 100
n = 5000

def boris(str)
c = 0
i = str.size
begin
i = str[0,i + 2].rindex("ana")
c += 1 if i != nil
end until i == nil
c
end

def david(str)
str.scan(/(?=ana)/).size
end

bm do |x|
x.report { n.times { boris(str) } }
x.report { n.times { david(str) } }
end

Results:

$ ruby ban.rb
user system total real
27.460000 0.490000 27.950000 ( 28.048449)
8.790000 0.020000 8.810000 ( 8.800876)

David

As far as I remember he didn't do the size trick your used. That might be
one point.

robert

Ardanwen · Dec 16, 2004

First time trying to use ruby's benchmark module. Doing something
wrong,
----------
boris:falcon:~/SPEC:ruby bench-search.rb
user system total real
../benchmark.rb:148:in `times': undefined method `times' for
Process:Module (NameError)
----------
so I switched back to linux's "time bench-search.rb".

Re-rewrote your program to something that resembles mine more again,
and then
they're similarly slow again (n=50000).

boris:falcon:~/SPEC:time bench-search.rb 7.430u 0.060s 0:07.99 93.7%
0+0k 0+0io 359pf+0w (david)
boris:falcon:~/SPEC:time bench-search.rb 7.480u 0.010s 0:07.54 99.3%
0+0k 0+0io 348pf+0w (boris)

Not sure why you find something so different. Gutfeeling says it might
be because of your search size (3 chars instead of 5 chars, or the
specific search- and searched-string you use. Doing some testing right
now (n = 50000 though).

Now with a 5char nonchanging pattern, changing searched-string
[syntax: program pattern randomseed]
boris:falcon:~/SPEC:time boris.rb 11111 1 8.210u 0.140s 0:08.43 99.0%
boris:falcon:~/SPEC:time boris.rb 11111 1 8.320u 0.100s 0:08.57 98.2%
boris:falcon:~/SPEC:time boris.rb 11111 2 7.920u 0.030s 0:07.98 99.6%
boris:falcon:~/SPEC:time boris.rb 11111 2 7.980u 0.020s 0:07.99 100.1%
boris:falcon:~/SPEC:time boris.rb 11111 3 4.760u 0.010s 0:04.77 100.0%
boris:falcon:~/SPEC:time boris.rb 11111 3 4.770u 0.020s 0:04.82 99.3%
boris:falcon:~/SPEC:time david.rb 11111 1 7.780u 0.000s 0:07.79 99.8%
boris:falcon:~/SPEC:time david.rb 11111 1 7.840u 0.000s 0:07.91 99.1%
boris:falcon:~/SPEC:time david.rb 11111 2 7.530u 0.000s 0:07.58 99.3%
boris:falcon:~/SPEC:time david.rb 11111 2 7.440u 0.000s 0:07.51 99.0%
boris:falcon:~/SPEC:time david.rb 11111 3 5.450u 0.000s 0:05.42 100.5%
boris:falcon:~/SPEC:time david.rb 11111 3 5.510u 0.000s 0:05.60 98.3%

Seems like searched-string pattern matters. Now look at search-string.
Specific search string you use matters quite a lot.
boris:falcon:~/SPEC:time boris.rb 10101 1 9.520u 0.000s 0:09.52 100.0%
boris:falcon:~/SPEC:time boris.rb 10101 2 9.290u 0.010s 0:09.24 100.6%
boris:falcon:~/SPEC:time boris.rb 10101 3 8.620u 0.030s 0:08.70 99.4%
boris:falcon:~/SPEC:time david.rb 10101 1 9.420u 0.000s 0:09.51 99.0%
boris:falcon:~/SPEC:time david.rb 10101 2 8.710u 0.000s 0:08.78 99.2%
boris:falcon:~/SPEC:time david.rb 10101 3 8.540u 0.010s 0:08.79 97.2%

So not sure if that's enough to explain the difference between 1's and
0's and banana's. Another factor is the size of the search string, but
I grow a bit weary of testing

.

Anyway, thanks for testing. Time to look at that search-string link.

bench-search.rb
----------------------------------------
#! /usr/bin/env ruby
FIVE =
["00000","00001","00010","00011","00100","00101","00110","00111","01000","01001","01010","01011","01100","01101","01110","01111","10000","10001","10010","10011","10100","10101","10110","10111","11000","11001","11010","11011","11100","11101","11110","11111"]

str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

srand(1)

n = 50000

def boris(str)
pattern = FIVE[rand(32)]
c = 0
i = 1000
begin
i = str[0,i + 4].rindex(pattern)
c += 1 if i != nil
end until i == nil
c
end

def david(str)
pattern = FIVE[rand(32)]
str.scan(/(?=#{pattern})/).size
end

n.times { boris(str) }
#n.times { david(str) }
-----------------------------------------

boris.rb
------------------------
#! /usr/bin/env ruby

# This bit makes a string length 1000 filled with random 0's and 1's
srand(ARGV[1].to_i)
str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

# This is the method.
def boris(str)
c = 0
i = 1000
begin
i = str[0,i + 4].rindex(ARGV[0])
c += 1 if i != nil
end until i == nil
c
end

# Here we run it.
50000.times { boris(str) }
--------------------------

david.rb
------------
#! /usr/bin/env ruby

# This bit makes a string length 1000 filled with random 0's and 1's
srand(ARGV[1].to_i)
str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

# This is the method.
def david(str)
str.scan(/(?=#{ARGV[0]})/).size
end
# Here we run it.
50000.times { david(str) }
---------------------

Ardanwen · Dec 16, 2004

Ugh. I need to learn how to preserve indentation in google groups

.
New to newsgroups in general, I'll go look for a faq. For google groups
users, the original (but sadly still unindented code) is under 'show
options' 'show original'. For some reason, a part changes and becomes
blue in the normal google view.

problems with oniguruma lookahead	2	Sep 9, 2008
Help with pointers	1	Mar 13, 2022
Catching wx events	2	Jan 18, 2007
Catching overriding methods with wrong signature	9	Apr 28, 2007
[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
Fun with function argument counts	0	Feb 12, 2014
Get functions names with regex	5	Sep 25, 2009
Messing with the GC	7	Jan 19, 2013

String.scan - catching overlapping patterns with lookahead

Ardanwen

Robert Klemme

David A. Black

Robert Klemme

Ardanwen

Ardanwen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads