String search vs regexp search

Anand Pillai · Oct 12, 2003

To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compile(substr, re.IGNORECASE)

m=strre.search(str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.

Thanks folks!

-Anan

Duncan Booth · Oct 13, 2003

(e-mail address removed) (Anand Pillai) wrote in

To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compile(substr, re.IGNORECASE)

m=strre.search(str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.

Yes. The answer is 'it all depends'.

Things it depends on include:

Your two bits of code do different things, one is case sensitive, one
ignores case. Which did you need?

How long is the string you are searching? How long is the substring?

Is the substring the same every time, or are you always searching for
different strings. Can the substring contain characters with special
meanings for regular expressions?

The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Bottom line: write the code so it is as clean and maintainable as possible.
Only worry about optimising this if you have timed it and know that your
searches are a bottleneck.

Anand Pillai · Oct 13, 2003

Sorry for being too brief!

I was talking about a function which 'counts' the number
of occurences using string & regexp.

I wrote the code for the regexp search as well as the function
search and tested it on a rather large file (800 KB) for
occurences of a certain word. I find that the string search
is at least 2 times faster than the one with regexp, excluding
the time for the regexp.compile() method. This is particularly
noticeable when the file becomes quite large and the word is
spread out.

I also thought the regexp would beat string thumbs down and I
am suprised at the result that it is the other way around.

Here is the code. Note that I am using the 'count' methods that
count the number of occurences rather than the 'find' methods.

# Test to find out whether string search in a data
# is faster than regexp search.

# Results: String search is much faster when it comes
# to many occurences of the sub string.

import time

def strsearch1(s, substr):

t1 = time.time()
print 'Count 1 =>', s.count(substr)
t2 = time.time()
print 'Searching using string, Time taken => ', t2 - t1

def strsearch2(s, substr):

import re

r=re.compile(substr, re.IGNORECASE)
t1 = time.time()
print 'Count 2 =>', len(r.findall(s))
t2 = time.time()
print 'Searching using regexp, Time taken => ', t2 - t1

data=open("test.html", "r").read()
strsearch1(data, "Miriam")
strsearch2(data, "Miriam")

# Output here...

D:\Programming\python>python strsearch.py
Count 1 => 45
Searching using string, Time taken => 0.0599999427795
Count 2 => 45
Searching using regexp, Time taken => 0.110000014305

Test was done on a windows 98 machine using Python 2.3, running
on 248 MB RAM, Intel 1.7 GHz chipset.

I was thinking of using regexp searches in my code, but this convinces
me to stick on to the good old string search.

Thanks for the replies.

-Anand

Jeremy Fincher · Oct 14, 2003

Duncan Booth said:
The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Both regular expression searching and string.find will do searching
one character at a time; given that, it seems impossible to me that
the hand-coded-in-C "naive" string.find could be slower than the
machine-translated-coded-in-Python regular expression search.
Compilation time only serves to further increase string.find's
advantage.

Jeremy

Dennis Reinhardt · Oct 14, 2003

I find that the string search

is at least 2 times faster than the one with regexp
r=re.compile(substr, re.IGNORECASE)

Off the top, maybe the regex is taking twice the time because it is doing
twice the work looking for both the lower case character and the upper case
character. It does not seem a fair test because the string find is case
sensitive.

Duncan Booth · Oct 14, 2003

(e-mail address removed) (Jeremy Fincher) wrote in

Both regular expression searching and string.find will do searching
one character at a time; given that, it seems impossible to me that
the hand-coded-in-C "naive" string.find could be slower than the
machine-translated-coded-in-Python regular expression search.
Compilation time only serves to further increase string.find's
advantage.

I may have misremembered, but I thought there was a thread discussing this
a little while back which claimed that the regular expression library
looked for constant strings at the start of the regex, and if it found one
used Boyer-Moore to do the search. If it does, then regular expressions
searching for a constant string certainly ought to be much faster than a
plain string.find (as the length of the searched string tends towards
infinity).

If it doesn't, then it should.

Duncan Booth · Oct 14, 2003

I may have misremembered, but I thought there was a thread discussing
this a little while back which claimed that the regular expression
library looked for constant strings at the start of the regex, and if
it found one used Boyer-Moore to do the search. If it does, then
regular expressions searching for a constant string certainly ought to
be much faster than a plain string.find (as the length of the searched
string tends towards infinity).

If it doesn't, then it should.

Ok, found the code. Regular expression searches do indeed use a form of
Boyer-Moore, but not if you are ignoring case. So by specifying
re.IGNORECASE the OP got a double hit, not only does the code have to do
case insensitive comparisons, but it also has to crawl along looking at
every character in the search string instead of skipping most of them.

Jeremy Fincher · Oct 14, 2003

Duncan Booth said:
Ok, found the code. Regular expression searches do indeed use a form of
Boyer-Moore, but not if you are ignoring case. So by specifying
re.IGNORECASE the OP got a double hit, not only does the code have to do
case insensitive comparisons, but it also has to crawl along looking at
every character in the search string instead of skipping most of them.

That's cool! Where'd you find the code?

Jeremy

Alex Martelli · Oct 14, 2003

Jeremy said:
That's cool! Where'd you find the code?

Hmmm, dist/src/Modules/_sre.c in the Python CVS tree? Or Modules/_sre.c in
a standard source distribution?

Alex

Duncan Booth · Oct 15, 2003

Hmmm, dist/src/Modules/_sre.c in the Python CVS tree? Or
Modules/_sre.c in a standard source distribution?

Close, but it is actually a little bit complicated. _sre.c has the code
that does the search, including the skipping forward using an overlap
table, but the bit which checks for a literal prefix and builds the overlap
table is in lib/src_compile.py. So its in the Python code you have to look
to find that it ignores literal prefixes when ignoring case.

Anand Pillai · Oct 15, 2003

Small point. If you notice the code, compilation time is
not included in the actual time calculated for regexp.

-Anand

regexp search on infinite string?	6	Sep 14, 2007
Patricia trie vs binary search.	32	May 24, 2012
a interesting Parallel Programing Problem: asciify-string	0	Mar 6, 2012
std::string::find vs std::find	1	Jul 5, 2009
best search method	5	Mar 14, 2011
get back my simple little string after re search and replace	0	Jan 13, 2010
Multi-line replace with string, not regexp	11	Aug 8, 2010
GoogleHack Search	0	Jun 6, 2009

String search vs regexp search

Anand Pillai

Duncan Booth

Anand Pillai

Jeremy Fincher

Dennis Reinhardt

Duncan Booth

Duncan Booth

Jeremy Fincher

Alex Martelli

Duncan Booth

Anand Pillai

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads