RegExp performance?

Christian Sonne · Feb 25, 2007

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
isbn10s = reISBN10.findall(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?
- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem

Any and all guidance would be greatly appreciated,
kind regards
Christian Sonne

PS: be gentle - it's my first post here

Gabriel Genellina · Feb 25, 2007

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

Why the .* at the start and end? You dont want to match those, and makes
your regexp slow.
You didn't tell how exactly a ISBN-10 number looks like, but if you want
to match 10 digits, or 9 digits followed by an X:
reISBN10 = re.compile("\d{10}|\d{9}X")
That is, just the () group in your expression. But perhaps this other one
is better (I think it should be faster, but you should measure it):
reISBN10 = re.compile("\d{9}[\dX]")
("Nine digits followed by another digit or an X")

if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

Using my suggested expressions you might match some garbage, but not loose
anything (except two ISBN numbers joined together without any separator in
between). Assuming you have stripped all the "-", as you said.

So what's the deal? - why would it take so long to run the correct one?
- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

Those .* make the expression match a LOT of things at first, just to
discard it in the next step.

Marc 'BlackJack' Rintsch · Feb 25, 2007

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
isbn10s = reISBN10.findall(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

First of all try to get rid of the '.*' at both ends of the regexp. Don't
let the re engine search for any characters that you are not interested in
anyway.

Then leave off the '*' after '\D'. It doesn't matter if there are
multiple non-digits before or after the ISBN, there just have to be at
least one. BTW with the star it even matches *no* non-digit too!

So the re looks like this: '\D(\d{10}|\d{9}X)\D'

Ciao,
Marc 'BlackJack' Rintsch

John Machin · Feb 25, 2007

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

All of those *s are making it work too hard.

Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
r"..." not just "....", don't you?]

Step 1: Lose the .* off each end -- this is meaningless in the context
of a search() or findall() and would slow the re engine down if it
doesn't optimise it away.

r"\D*(\d{10}|\d{9}X)\D*"

Step 2: I presume that the \D* at each (remaining) end is to ensure
that you don't pick up a number with 11 or more digits. You only need
to test that your presumed ISBN is not preceded/followed by ONE
suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
instead of \D

r"\b(\d{10}|\d{9}X)\b"

Step 3: Now that we have only \b (which matches 0 characters) at each
end of the ISBN, we can lose the capturing ()

r"\b\d{10}|\d{9}X\b"

Step 4: In the case of 123456789X, it fails on the X and then scans
the 123456789 again -- a tiny waste compared to all the * stuff, but
still worth fixing.

r"\b\d{9}[0-9X]\b"

Give that a whirl and let us know how correct and how fast it is.

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")

You should really get into the habit of using raw strings with re.

isbn10s = reISBN10.findall(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

How many actual matches and how many expected matches?

Note on "and there are only one or two matches": if your data
consisted only of valid ISBNs separated by a single space or comma, it
would run much faster. It is all the quadratic/exponential mucking
about with the in-between bits that slows it down. To demonstrate
this, try timing dummy data like "1234567890 " * 1000000 and
"123456789X " * 1000000 with your various regexes, and with the step1,
step2 etc regexes above.

if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?

Because you have .*\D* in other words 0 or more occurrences of almost
anything followed by 0 or more occurrences of almost anything. Even
assuming it ignores the .*, it will find the longest possible sequence
of non-digits, then try to match the ISBN stuff. If it fails, it will
shorten that sequence of non-digits, try the ISBN stuff again, etc etc
until it matches the ISBN stuff or that sequence of non-digits is down
to zero length. It will do that for each character position in the
file contents. Why is it faster when you change \D to []? Presumably
because in your data, sequences of non-digits are longer than
sequences of spaces IOW there is less stuff to back-track over.

- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem

What you call "dummy data", I'd call "test data". You should make a
sample of "dummy data" and test that your regex is (a) finding all
ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
being concerned with speed.

Any and all guidance would be greatly appreciated,

For some light reading

borrow a copy of Friedl's book (mentioned
in the Python re docs) and read the parts on how backtracking regex
engines work.

HTH,
John

Christian Sonne · Feb 25, 2007

John said:
Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

Click to expand...

All of those *s are making it work too hard.

Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
r"..." not just "....", don't you?]

Step 1: Lose the .* off each end -- this is meaningless in the context
of a search() or findall() and would slow the re engine down if it
doesn't optimise it away.

r"\D*(\d{10}|\d{9}X)\D*"

Step 2: I presume that the \D* at each (remaining) end is to ensure
that you don't pick up a number with 11 or more digits. You only need
to test that your presumed ISBN is not preceded/followed by ONE
suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
instead of \D

r"\b(\d{10}|\d{9}X)\b"

Step 3: Now that we have only \b (which matches 0 characters) at each
end of the ISBN, we can lose the capturing ()

r"\b\d{10}|\d{9}X\b"

Step 4: In the case of 123456789X, it fails on the X and then scans
the 123456789 again -- a tiny waste compared to all the * stuff, but
still worth fixing.

r"\b\d{9}[0-9X]\b"

Give that a whirl and let us know how correct and how fast it is.

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")

Click to expand...

You should really get into the habit of using raw strings with re.

isbn10s = reISBN10.findall(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

Click to expand...

How many actual matches and how many expected matches?

Note on "and there are only one or two matches": if your data
consisted only of valid ISBNs separated by a single space or comma, it
would run much faster. It is all the quadratic/exponential mucking
about with the in-between bits that slows it down. To demonstrate
this, try timing dummy data like "1234567890 " * 1000000 and
"123456789X " * 1000000 with your various regexes, and with the step1,
step2 etc regexes above.

if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?

Click to expand...

Because you have .*\D* in other words 0 or more occurrences of almost
anything followed by 0 or more occurrences of almost anything. Even
assuming it ignores the .*, it will find the longest possible sequence
of non-digits, then try to match the ISBN stuff. If it fails, it will
shorten that sequence of non-digits, try the ISBN stuff again, etc etc
until it matches the ISBN stuff or that sequence of non-digits is down
to zero length. It will do that for each character position in the
file contents. Why is it faster when you change \D to []? Presumably
because in your data, sequences of non-digits are longer than
sequences of spaces IOW there is less stuff to back-track over.

- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem

Click to expand...

What you call "dummy data", I'd call "test data". You should make a
sample of "dummy data" and test that your regex is (a) finding all
ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
being concerned with speed.

Any and all guidance would be greatly appreciated,

Click to expand...

For some light reading borrow a copy of Friedl's book (mentioned
in the Python re docs) and read the parts on how backtracking regex
engines work.

HTH,
John

Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...

I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn

Kirk Sluder · Feb 26, 2007

Christian Sonne said:
Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...

I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn

Anything with variable-length wildcard matching (*+?) is going to
drag your performance down. There was an earlier thread on this very
topic. Another stupid question is how are you planning on handling
ISBNs formatted with hyphens for readability?

In general I've found the following factors to be critical in
getting good performance from re:

1: Reducing the number of times you call re.match or re.search.
2: Reducing the number of bytes that re has to search through.
3: Minimizing the use of wildcards in the expression.

If you can pre-filter your input with string.find before running
re.match you will improve performance quite a bit compared to
running re expressions over all 10 pages. I played around a bit and
attached some example code below searching over 21K of text for the
ISBN number. testPrefilter() runs about 1/5th the execution time of
line-by-line re calls or a single re call over a 21K string.
Interestingly this ratio scales up to something as big as Mobey
Dick.

The searchLabels() functions below beats all the other functions by
searching for "ISBN", or "International Book" and then using RE on
the surrounding 500 bytes. You might also try searching for
"Copyright" or "Library of Congress" since most modern books will
have it all on the same page.

A caveat here is that this only works if you can find a reasonably
unique string at or near what you want to find with re. If you need
to run re.search on every byte of the file anyway, this isn't going
to help.

---------------
timing test code
---------------

#!/usr/bin/env python

from timeit import Timer
import re

textString = """The text of a sample page using with an ISBN 10
number ISBN 0672328976 and some more text to compare."""

#add the full text of Mobey Dick to make the search functions
#work for their bread.
fileHandle= open("./mobey.txt")
junkText = fileHandle.readlines()
junkText.append(textString)
textString=''.join(junkText)
#print textString

#compile the regex
isbn10Re = re.compile(r"\b\d{9}[0-9X]\b")

def testPrefilter():
"""Work through a pre-loaded array running re only on lines
containing ISBN"""
for line in junkText:
#search for 'ISBN"
if (line.find('ISBN') > -1):
thisMatch = isbn10Re.search(line)
if thisMatch:
return thisMatch.group(0)

def testNofilter():
"""Run re.search on every line."""
for line in junkText:
#seaching using RE
thisMatch = isbn10Re.search(line)
if thisMatch:
return thisMatch.group(0)

def testFullre():
"""Run re.search on a long text string."""
thisMatch = isbn10Re.search(textString)
if thisMatch:
return thisMatch.group(0)

def searchLabels():
#identify some text that might be near an ISBN number.
isbnLabels=["ISBN",
"International Book"]

#use the fast string.find method to locate those
#labels in the text
isbnIndexes = [textString.find(x) for x in isbnLabels]

#exclude labels not found in the text.
isbnIndexes = [y for y in isbnIndexes if y > -1]

#run re.search on a 500 character string around the text label.
for x in isbnIndexes:
thisMatch=isbn10Re.search(textString[x-250:x+250])
return thisMatch.group(0)

#print searchLabels()
#print testPrefilter()
#print testNofilter()
t = Timer("testNofilter()","from __main__ import testNofilter")
print t.timeit(100)
u = Timer("testPrefilter()","from __main__ import testPrefilter")
print u.timeit(100)
v = Timer("testFullre()","from __main__ import testFullre")
print v.timeit(100)
w = Timer("searchLabels()", "from __main__ import searchLabels")
print w.timeit(100)

John Machin · Feb 26, 2007

Christian Sonne said:
Christian Sonne said:

Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...

Click to expand...

I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn

Click to expand...

Anything with variable-length wildcard matching (*+?) is going to
drag your performance down. There was an earlier thread on this very
topic. Another stupid question is how are you planning on handling
ISBNs formatted with hyphens for readability?

According to the OP's first message, 2nd paragraph:
"""
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
"""

Given a low density of ISBNs in the text, it may well be better to
avoid the preliminary pass to rip out the '-'s, and instead:

1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
inside the number)

2. post-process the matches: strip out any '-'s, check for remaining
length == 10.

Another thought for the OP: Consider (irrespective of how you arrive
at a candidate ISBN) validating the ISBN check-digit.

Cheers,
John

Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Need help with this script	4	Mar 12, 2023
Collect Excel Data from Website	5	Apr 30, 2022
Fibonacci C code	1	Aug 4, 2018
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Another Regexp Question	4	Jul 6, 2010
Installation of yappi (timing module)	0	Aug 24, 2012
Help with code	0	Jun 12, 2022

RegExp performance?

Christian Sonne

Gabriel Genellina

Marc 'BlackJack' Rintsch

John Machin

Christian Sonne

Kirk Sluder

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads