What algorithm does Python use to evaluate: if substring in string

T

Tor Erik

I would be surprised if it is the naive:

m = 0
s1 = "me"
s2 = "locate me"
s1len = len(s1)
s2len = len(s2)
found = False

while m + s1len <= s2len:
if s1 == s2len[m:m+s1len]:
found = True
break
m += 1
 
M

Marc 'BlackJack' Rintsch

I would be surprised if it is the naive:

Why?

I guess it simply calls an appropriate C library function.

Ciao,
Marc 'BlackJack' Rintsch
 
T

Tor Erik

Alex said:
Yep -- it's "a mix between Boyer-Moore and Horspool with a few more
bells and whistles on the top", as documented and implemented in
Objects/stringlib/fastsearch.h in the Python sources and well discussed
and explained at http://effbot.org/zone/stringlib.htm .


Alex

Ok. Two questions:

1. Is "a in b" simply an alias for "b.find(a)"?

2. Is this algorithm exclusive to Python 2.5, or is it contained in 2.4
aswell?
 
A

Alex Martelli

Tor Erik said:
Ok. Two questions:

1. Is "a in b" simply an alias for "b.find(a)"?

The 'in' operator can be minutely better optimized, but they share the
underlying algorithm (in 2.5).
2. Is this algorithm exclusive to Python 2.5, or is it contained in 2.4
aswell?

It's 2.5 novelty. Look at the performance on the same machine (my 2.0
GHz MBP, MacOSX 10.4.7):

brain:~ alex$ python2.4 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77' 'x in
y'
100000 loops, best of 3: 9.04 usec per loop
brain:~ alex$ python2.4 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77'
'y.find(x)!=-1'
100000 loops, best of 3: 2.01 usec per loop
brain:~ alex$ python2.5 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77' 'x in
y'1000000 loops, best of 3: 0.452 usec per loop
brain:~ alex$ python2.5 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77'
'y.find(x)!=-1'
1000000 loops, best of 3: 0.842 usec per loop

find used to be way faster than 'in' -- now they share algorithms and
'in' can be more optimized (no need to track ``where'' it finds a match,
so to speak;-), so find is over twice as fast as it used to be, 'in' is
about 20 times as fast as it used to be, in this example -- it gets even
better if you look at larger and larger strings, e.g...:

brain:~ alex$ python2.4 -mtimeit -s'x="foo"*123;y="bar"*999+x+"baz"*777'
'x in y'
10000 loops, best of 3: 91.9 usec per loop
brain:~ alex$ python2.5 -mtimeit -s'x="foo"*123;y="bar"*999+x+"baz"*777'
'x in y'
100000 loops, best of 3: 3.01 usec per loop

here, we're going _30_ times as fast, not "just" 20;-).


Alex
 
F

Fredrik Lundh

Tor Erik said:
I would be surprised if it is the naive:

2.4 and earlier uses that algorithm (but with a better implementation).

And "naive" isn't really the right word here; on average, a brute force search is pretty
good for the find/index/in use case. Most fancy algorithms ignore the setup costs and
other overhead, which works well in theory, but not that well in practice.

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top