RegExp performance?

Discussion in 'Python' started by Christian Sonne, Feb 25, 2007.

  1. Long story short, I'm trying to find all ISBN-10 numbers in a multiline
    string (approximately 10 pages of a normal book), and as far as I can
    tell, the *correct* thing to match would be this:
    ".*\D*(\d{10}|\d{9}X)\D*.*"

    (it should be noted that I've removed all '-'s in the string, because
    they have a tendency to be mixed into ISBN's)

    however, on my 3200+ amd64, running the following:

    reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
    isbn10s = reISBN10.findall(contents)

    (where contents is the string)

    this takes about 14 minutes - and there are only one or two matches...

    if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
    loosing results, but it runs in about 0.3 seconds

    So what's the deal? - why would it take so long to run the correct one?
    - especially when a slight modification makes it run as fast as I'd
    expect from the beginning...


    I'm sorry I cannot supply test data, in my case, it comes from
    copyrighted material - however if it proves needed, I can probably
    construct dummy data to illustrate the problem


    Any and all guidance would be greatly appreciated,
    kind regards
    Christian Sonne

    PS: be gentle - it's my first post here :)
     
    Christian Sonne, Feb 25, 2007
    #1
    1. Advertising

  2. En Sun, 25 Feb 2007 05:21:49 -0300, Christian Sonne <>
    escribió:

    > Long story short, I'm trying to find all ISBN-10 numbers in a multiline
    > string (approximately 10 pages of a normal book), and as far as I can
    > tell, the *correct* thing to match would be this:
    > ".*\D*(\d{10}|\d{9}X)\D*.*"


    Why the .* at the start and end? You dont want to match those, and makes
    your regexp slow.
    You didn't tell how exactly a ISBN-10 number looks like, but if you want
    to match 10 digits, or 9 digits followed by an X:
    reISBN10 = re.compile("\d{10}|\d{9}X")
    That is, just the () group in your expression. But perhaps this other one
    is better (I think it should be faster, but you should measure it):
    reISBN10 = re.compile("\d{9}[\dX]")
    ("Nine digits followed by another digit or an X")

    > if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
    > loosing results, but it runs in about 0.3 seconds


    Using my suggested expressions you might match some garbage, but not loose
    anything (except two ISBN numbers joined together without any separator in
    between). Assuming you have stripped all the "-", as you said.

    > So what's the deal? - why would it take so long to run the correct one?
    > - especially when a slight modification makes it run as fast as I'd
    > expect from the beginning...


    Those .* make the expression match a LOT of things at first, just to
    discard it in the next step.

    --
    Gabriel Genellina
     
    Gabriel Genellina, Feb 25, 2007
    #2
    1. Advertising

  3. In <45e145a0$0$90264$>, Christian Sonne wrote:

    > Long story short, I'm trying to find all ISBN-10 numbers in a multiline
    > string (approximately 10 pages of a normal book), and as far as I can
    > tell, the *correct* thing to match would be this:
    > ".*\D*(\d{10}|\d{9}X)\D*.*"
    >
    > (it should be noted that I've removed all '-'s in the string, because
    > they have a tendency to be mixed into ISBN's)
    >
    > however, on my 3200+ amd64, running the following:
    >
    > reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
    > isbn10s = reISBN10.findall(contents)
    >
    > (where contents is the string)
    >
    > this takes about 14 minutes - and there are only one or two matches...


    First of all try to get rid of the '.*' at both ends of the regexp. Don't
    let the re engine search for any characters that you are not interested in
    anyway.

    Then leave off the '*' after '\D'. It doesn't matter if there are
    multiple non-digits before or after the ISBN, there just have to be at
    least one. BTW with the star it even matches *no* non-digit too!

    So the re looks like this: '\D(\d{10}|\d{9}X)\D'

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Feb 25, 2007
    #3
  4. Christian Sonne

    John Machin Guest

    On Feb 25, 7:21 pm, Christian Sonne <> wrote:
    > Long story short, I'm trying to find all ISBN-10 numbers in a multiline
    > string (approximately 10 pages of a normal book), and as far as I can
    > tell, the *correct* thing to match would be this:
    > ".*\D*(\d{10}|\d{9}X)\D*.*"


    All of those *s are making it work too hard.

    Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
    r"..." not just "....", don't you?]

    Step 1: Lose the .* off each end -- this is meaningless in the context
    of a search() or findall() and would slow the re engine down if it
    doesn't optimise it away.

    r"\D*(\d{10}|\d{9}X)\D*"

    Step 2: I presume that the \D* at each (remaining) end is to ensure
    that you don't pick up a number with 11 or more digits. You only need
    to test that your presumed ISBN is not preceded/followed by ONE
    suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
    instead of \D

    r"\b(\d{10}|\d{9}X)\b"

    Step 3: Now that we have only \b (which matches 0 characters) at each
    end of the ISBN, we can lose the capturing ()

    r"\b\d{10}|\d{9}X\b"

    Step 4: In the case of 123456789X, it fails on the X and then scans
    the 123456789 again -- a tiny waste compared to all the * stuff, but
    still worth fixing.

    r"\b\d{9}[0-9X]\b"

    Give that a whirl and let us know how correct and how fast it is.

    >
    > (it should be noted that I've removed all '-'s in the string, because
    > they have a tendency to be mixed into ISBN's)
    >
    > however, on my 3200+ amd64, running the following:
    >
    > reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")


    You should really get into the habit of using raw strings with re.

    > isbn10s = reISBN10.findall(contents)
    >
    > (where contents is the string)
    >
    > this takes about 14 minutes - and there are only one or two matches...


    How many actual matches and how many expected matches?

    Note on "and there are only one or two matches": if your data
    consisted only of valid ISBNs separated by a single space or comma, it
    would run much faster. It is all the quadratic/exponential mucking
    about with the in-between bits that slows it down. To demonstrate
    this, try timing dummy data like "1234567890 " * 1000000 and
    "123456789X " * 1000000 with your various regexes, and with the step1,
    step2 etc regexes above.

    >
    > if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
    > loosing results, but it runs in about 0.3 seconds
    >
    > So what's the deal? - why would it take so long to run the correct one?


    Because you have .*\D* in other words 0 or more occurrences of almost
    anything followed by 0 or more occurrences of almost anything. Even
    assuming it ignores the .*, it will find the longest possible sequence
    of non-digits, then try to match the ISBN stuff. If it fails, it will
    shorten that sequence of non-digits, try the ISBN stuff again, etc etc
    until it matches the ISBN stuff or that sequence of non-digits is down
    to zero length. It will do that for each character position in the
    file contents. Why is it faster when you change \D to []? Presumably
    because in your data, sequences of non-digits are longer than
    sequences of spaces IOW there is less stuff to back-track over.


    > - especially when a slight modification makes it run as fast as I'd
    > expect from the beginning...
    >
    > I'm sorry I cannot supply test data, in my case, it comes from
    > copyrighted material - however if it proves needed, I can probably
    > construct dummy data to illustrate the problem


    What you call "dummy data", I'd call "test data". You should make a
    sample of "dummy data" and test that your regex is (a) finding all
    ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
    being concerned with speed.

    >
    > Any and all guidance would be greatly appreciated,


    For some light reading :) borrow a copy of Friedl's book (mentioned
    in the Python re docs) and read the parts on how backtracking regex
    engines work.

    HTH,
    John
     
    John Machin, Feb 25, 2007
    #4
  5. John Machin wrote:
    > On Feb 25, 7:21 pm, Christian Sonne <> wrote:
    >> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
    >> string (approximately 10 pages of a normal book), and as far as I can
    >> tell, the *correct* thing to match would be this:
    >> ".*\D*(\d{10}|\d{9}X)\D*.*"

    >
    > All of those *s are making it work too hard.
    >
    > Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
    > r"..." not just "....", don't you?]
    >
    > Step 1: Lose the .* off each end -- this is meaningless in the context
    > of a search() or findall() and would slow the re engine down if it
    > doesn't optimise it away.
    >
    > r"\D*(\d{10}|\d{9}X)\D*"
    >
    > Step 2: I presume that the \D* at each (remaining) end is to ensure
    > that you don't pick up a number with 11 or more digits. You only need
    > to test that your presumed ISBN is not preceded/followed by ONE
    > suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
    > instead of \D
    >
    > r"\b(\d{10}|\d{9}X)\b"
    >
    > Step 3: Now that we have only \b (which matches 0 characters) at each
    > end of the ISBN, we can lose the capturing ()
    >
    > r"\b\d{10}|\d{9}X\b"
    >
    > Step 4: In the case of 123456789X, it fails on the X and then scans
    > the 123456789 again -- a tiny waste compared to all the * stuff, but
    > still worth fixing.
    >
    > r"\b\d{9}[0-9X]\b"
    >
    > Give that a whirl and let us know how correct and how fast it is.
    >
    >> (it should be noted that I've removed all '-'s in the string, because
    >> they have a tendency to be mixed into ISBN's)
    >>
    >> however, on my 3200+ amd64, running the following:
    >>
    >> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")

    >
    > You should really get into the habit of using raw strings with re.
    >
    >> isbn10s = reISBN10.findall(contents)
    >>
    >> (where contents is the string)
    >>
    >> this takes about 14 minutes - and there are only one or two matches...

    >
    > How many actual matches and how many expected matches?
    >
    > Note on "and there are only one or two matches": if your data
    > consisted only of valid ISBNs separated by a single space or comma, it
    > would run much faster. It is all the quadratic/exponential mucking
    > about with the in-between bits that slows it down. To demonstrate
    > this, try timing dummy data like "1234567890 " * 1000000 and
    > "123456789X " * 1000000 with your various regexes, and with the step1,
    > step2 etc regexes above.
    >
    >> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
    >> loosing results, but it runs in about 0.3 seconds
    >>
    >> So what's the deal? - why would it take so long to run the correct one?

    >
    > Because you have .*\D* in other words 0 or more occurrences of almost
    > anything followed by 0 or more occurrences of almost anything. Even
    > assuming it ignores the .*, it will find the longest possible sequence
    > of non-digits, then try to match the ISBN stuff. If it fails, it will
    > shorten that sequence of non-digits, try the ISBN stuff again, etc etc
    > until it matches the ISBN stuff or that sequence of non-digits is down
    > to zero length. It will do that for each character position in the
    > file contents. Why is it faster when you change \D to []? Presumably
    > because in your data, sequences of non-digits are longer than
    > sequences of spaces IOW there is less stuff to back-track over.
    >
    >
    >> - especially when a slight modification makes it run as fast as I'd
    >> expect from the beginning...
    >>
    >> I'm sorry I cannot supply test data, in my case, it comes from
    >> copyrighted material - however if it proves needed, I can probably
    >> construct dummy data to illustrate the problem

    >
    > What you call "dummy data", I'd call "test data". You should make a
    > sample of "dummy data" and test that your regex is (a) finding all
    > ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
    > being concerned with speed.
    >
    >> Any and all guidance would be greatly appreciated,

    >
    > For some light reading :) borrow a copy of Friedl's book (mentioned
    > in the Python re docs) and read the parts on how backtracking regex
    > engines work.
    >
    > HTH,
    > John
    >


    Thanks to all of you for your replies - they have been most helpful, and
    my program is now running at a reasonable pace...


    I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
    turns out to misbehave in further testing, I'll know where to turn :p
     
    Christian Sonne, Feb 25, 2007
    #5
  6. Christian Sonne

    Kirk Sluder Guest

    In article <45e1d367$0$90273$>,
    Christian Sonne <> wrote:

    > Thanks to all of you for your replies - they have been most helpful, and
    > my program is now running at a reasonable pace...
    >
    >
    > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
    > turns out to misbehave in further testing, I'll know where to turn :p


    Anything with variable-length wildcard matching (*+?) is going to
    drag your performance down. There was an earlier thread on this very
    topic. Another stupid question is how are you planning on handling
    ISBNs formatted with hyphens for readability?

    In general I've found the following factors to be critical in
    getting good performance from re:

    1: Reducing the number of times you call re.match or re.search.
    2: Reducing the number of bytes that re has to search through.
    3: Minimizing the use of wildcards in the expression.

    If you can pre-filter your input with string.find before running
    re.match you will improve performance quite a bit compared to
    running re expressions over all 10 pages. I played around a bit and
    attached some example code below searching over 21K of text for the
    ISBN number. testPrefilter() runs about 1/5th the execution time of
    line-by-line re calls or a single re call over a 21K string.
    Interestingly this ratio scales up to something as big as Mobey
    Dick.

    The searchLabels() functions below beats all the other functions by
    searching for "ISBN", or "International Book" and then using RE on
    the surrounding 500 bytes. You might also try searching for
    "Copyright" or "Library of Congress" since most modern books will
    have it all on the same page.

    A caveat here is that this only works if you can find a reasonably
    unique string at or near what you want to find with re. If you need
    to run re.search on every byte of the file anyway, this isn't going
    to help.

    ---------------
    timing test code
    ---------------

    #!/usr/bin/env python

    from timeit import Timer
    import re

    textString = """The text of a sample page using with an ISBN 10
    number ISBN 0672328976 and some more text to compare."""

    #add the full text of Mobey Dick to make the search functions
    #work for their bread.
    fileHandle= open("./mobey.txt")
    junkText = fileHandle.readlines()
    junkText.append(textString)
    textString=''.join(junkText)
    #print textString

    #compile the regex
    isbn10Re = re.compile(r"\b\d{9}[0-9X]\b")

    def testPrefilter():
    """Work through a pre-loaded array running re only on lines
    containing ISBN"""
    for line in junkText:
    #search for 'ISBN"
    if (line.find('ISBN') > -1):
    thisMatch = isbn10Re.search(line)
    if thisMatch:
    return thisMatch.group(0)

    def testNofilter():
    """Run re.search on every line."""
    for line in junkText:
    #seaching using RE
    thisMatch = isbn10Re.search(line)
    if thisMatch:
    return thisMatch.group(0)


    def testFullre():
    """Run re.search on a long text string."""
    thisMatch = isbn10Re.search(textString)
    if thisMatch:
    return thisMatch.group(0)

    def searchLabels():
    #identify some text that might be near an ISBN number.
    isbnLabels=["ISBN",
    "International Book"]

    #use the fast string.find method to locate those
    #labels in the text
    isbnIndexes = [textString.find(x) for x in isbnLabels]

    #exclude labels not found in the text.
    isbnIndexes = [y for y in isbnIndexes if y > -1]

    #run re.search on a 500 character string around the text label.
    for x in isbnIndexes:
    thisMatch=isbn10Re.search(textString[x-250:x+250])
    return thisMatch.group(0)



    #print searchLabels()
    #print testPrefilter()
    #print testNofilter()
    t = Timer("testNofilter()","from __main__ import testNofilter")
    print t.timeit(100)
    u = Timer("testPrefilter()","from __main__ import testPrefilter")
    print u.timeit(100)
    v = Timer("testFullre()","from __main__ import testFullre")
    print v.timeit(100)
    w = Timer("searchLabels()", "from __main__ import searchLabels")
    print w.timeit(100)
     
    Kirk Sluder, Feb 26, 2007
    #6
  7. Christian Sonne

    John Machin Guest

    On Feb 26, 2:01 pm, Kirk Sluder <> wrote:
    > In article <45e1d367$0$90273$>,
    > Christian Sonne <> wrote:
    >
    > > Thanks to all of you for your replies - they have been most helpful, and
    > > my program is now running at a reasonable pace...

    >
    > > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
    > > turns out to misbehave in further testing, I'll know where to turn :p

    >
    > Anything with variable-length wildcard matching (*+?) is going to
    > drag your performance down. There was an earlier thread on this very
    > topic. Another stupid question is how are you planning on handling
    > ISBNs formatted with hyphens for readability?


    According to the OP's first message, 2nd paragraph:
    """
    (it should be noted that I've removed all '-'s in the string, because
    they have a tendency to be mixed into ISBN's)
    """

    Given a low density of ISBNs in the text, it may well be better to
    avoid the preliminary pass to rip out the '-'s, and instead:

    1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
    inside the number)

    2. post-process the matches: strip out any '-'s, check for remaining
    length == 10.

    Another thought for the OP: Consider (irrespective of how you arrive
    at a candidate ISBN) validating the ISBN check-digit.

    Cheers,
    John
     
    John Machin, Feb 26, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    181
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    546
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    409
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    216
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    205
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page