Re: Curious to see alternate approach on a search/replace via regex

Discussion in 'Python' started by Demian Brecht, Feb 6, 2013.

  1. Well, an alternative /could/ be:

    from urlparse import urlparse

    parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
    print '%s%s_%s' % (parts.netloc.replace('.', '_'),
    parts.path.replace('/', '_'),
    parts.query.replace('&', '_').replace('=', '_')
    )


    Although with the result of:

    alongnameofasite1234567_com_q_sports_run_a_1_b_1
    1288 function calls in 0.004 seconds


    Compared to regex method:

    498 function calls (480 primitive calls) in 0.000 seconds

    I'd prefer the regex method myself.

    Demian Brecht
    http://demianbrecht.github.com




    On 2013-02-06 1:41 PM, "rh" <> wrote:

    >http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
     
    Demian Brecht, Feb 6, 2013
    #1
    1. Advertising

  2. On Wed, 06 Feb 2013 13:55:58 -0800, Demian Brecht wrote:

    > Well, an alternative /could/ be:
    >
    > from urlparse import urlparse
    >
    > parts =
    > urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
    > print '%s%s_%s' % (parts.netloc.replace('.', '_'),
    > parts.path.replace('/', '_'),
    > parts.query.replace('&', '_').replace('=', '_') )
    >
    >
    > Although with the result of:
    >
    > alongnameofasite1234567_com_q_sports_run_a_1_b_1
    > 1288 function calls in 0.004 seconds
    >
    >
    > Compared to regex method:
    >
    > 498 function calls (480 primitive calls) in 0.000 seconds
    >
    > I'd prefer the regex method myself.


    I dispute those results. I think you are mostly measuring the time to
    print the result, and I/O is quite slow. My tests show that using urlparse
    is 33% faster than using regexes, and far more understandable and
    maintainable.


    py> from urlparse import urlparse
    py> def mangle(url):
    .... parts = urlparse(url)
    .... return '%s%s_%s' % (parts.netloc.replace('.', '_'),
    .... parts.path.replace('/', '_'),
    .... parts.query.replace('&', '_').replace('=', '_')
    .... )
    ....
    py> import re
    py> def u2f(u):
    .... nx = re.compile(r'https?://(.+)$')
    .... u = nx.search(u).group(1)
    .... ux = re.compile(r'([-:./?&=]+)')
    .... return ux.sub('_', u)
    ....
    py> s = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'
    py> assert u2f(s) == mangle(s)
    py>
    py> from timeit import Timer
    py> setup = 'from __main__ import s, u2f, mangle'
    py> t1 = Timer('mangle(s)', setup)
    py> t2 = Timer('u2f(s)', setup)
    py>
    py> min(t1.repeat(repeat=7))
    7.2962000370025635
    py> min(t2.repeat(repeat=7))
    10.981598854064941
    py>
    py> (10.98-7.29)/10.98
    0.33606557377049184


    (Timings done using Python 2.6 on my laptop -- your speeds may vary.)



    --
    Steven
     
    Steven D'Aprano, Feb 7, 2013
    #2
    1. Advertising

  3. Demian Brecht

    rh Guest

    On 07 Feb 2013 03:04:39 GMT
    Steven D'Aprano <> wrote:

    > On Wed, 06 Feb 2013 13:55:58 -0800, Demian Brecht wrote:
    >
    > > Well, an alternative /could/ be:
    > >
    > > from urlparse import urlparse
    > >
    > > parts =
    > > urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
    > > print '%s%s_%s' % (parts.netloc.replace('.', '_'),
    > > parts.path.replace('/', '_'),
    > > parts.query.replace('&', '_').replace('=', '_') )
    > >
    > >
    > > Although with the result of:
    > >
    > > alongnameofasite1234567_com_q_sports_run_a_1_b_1
    > > 1288 function calls in 0.004 seconds
    > >
    > >
    > > Compared to regex method:
    > >
    > > 498 function calls (480 primitive calls) in 0.000 seconds
    > >
    > > I'd prefer the regex method myself.

    >
    > I dispute those results. I think you are mostly measuring the time to
    > print the result, and I/O is quite slow. My tests show that using
    > urlparse is 33% faster than using regexes, and far more
    > understandable and maintainable.
    >
    >
    > py> from urlparse import urlparse
    > py> def mangle(url):
    > ... parts = urlparse(url)
    > ... return '%s%s_%s' % (parts.netloc.replace('.', '_'),
    > ... parts.path.replace('/', '_'),
    > ... parts.query.replace('&', '_').replace('=', '_')
    > ... )
    > ...
    > py> import re
    > py> def u2f(u):
    > ... nx = re.compile(r'https?://(.+)$')
    > ... u = nx.search(u).group(1)
    > ... ux = re.compile(r'([-:./?&=]+)')
    > ... return ux.sub('_', u)
    > ...
    > py> s = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'
    > py> assert u2f(s) == mangle(s)
    > py>
    > py> from timeit import Timer
    > py> setup = 'from __main__ import s, u2f, mangle'
    > py> t1 = Timer('mangle(s)', setup)
    > py> t2 = Timer('u2f(s)', setup)
    > py>
    > py> min(t1.repeat(repeat=7))
    > 7.2962000370025635
    > py> min(t2.repeat(repeat=7))
    > 10.981598854064941
    > py>
    > py> (10.98-7.29)/10.98
    > 0.33606557377049184
    >
    >
    > (Timings done using Python 2.6 on my laptop -- your speeds may vary.)


    I am using 2.7.3 and I put the re.compile outside the function and it
    performed faster than urlparse. I don't print out the data.

    Fast
    ^
    | compiled regex
    | urlparse
    | plain regex
    | all-at-once search/replace with alternation
    Slow

    >
    >
    >
    > --
    > Steven
     
    rh, Feb 7, 2013
    #3
  4. Demian Brecht

    jmfauth Guest

    On 7 fév, 04:04, Steven D'Aprano <steve
    > wrote:
    > On Wed, 06 Feb 2013 13:55:58 -0800, Demian Brecht wrote:
    > > Well, an alternative /could/ be:

    >
    > ...
    > py> s = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'
    > py> assert u2f(s) == mangle(s)
    > py>
    > py> from timeit import Timer
    > py> setup = 'from __main__ import s, u2f, mangle'
    > py> t1 = Timer('mangle(s)', setup)
    > py> t2 = Timer('u2f(s)', setup)
    > py>
    > py> min(t1.repeat(repeat=7))
    > 7.2962000370025635
    > py> min(t2.repeat(repeat=7))
    > 10.981598854064941
    > py>
    > py> (10.98-7.29)/10.98
    > 0.33606557377049184
    >
    > (Timings done using Python 2.6 on my laptop -- your speeds may vary.)
    >


    --------


    [OT] Sorry, but I find all these "timeit" I see here and there
    more and more ridiculous.

    Maybe it's the language itself, which became ridiculous.


    code:

    r = repeat("('WHERE IN THE WORLD IS CARMEN?'*10).lower()")
    print('1:', r)

    r = repeat("('WHERE IN THE WORLD IS HÉLÈNE?'*10).lower()")
    print('2:', r)

    t = Timer("re.sub('CARMEN', 'CARMEN', 'WHERE IN THE WORLD IS
    CARMEN?'*10)", "import re")
    r = t.repeat()
    print('3:', r)

    t = Timer("re.sub('HÉLÈNE', 'HÉLÈNE', 'WHERE IN THE WORLD IS
    HÉLÈNE?'*10)", "import re")
    r = t.repeat()
    print('4:', r)

    result:

    >c:\python32\pythonw -u "vitesse3.py"

    1: [2.578785478740226, 2.5738459157233833, 2.5739002658825543]
    2: [2.57605654937141, 2.5784755252962572, 2.5775366066044896]
    3: [11.856728254324088, 11.856321809655501, 11.857456073846905]
    4: [12.111787643688231, 12.102743462128771, 12.098514783440208]
    >Exit code: 0
    >c:\Python33\pythonw -u "vitesse3.py"

    1: [0.6063335264470632, 0.6104798922133946, 0.6078580877959869]
    2: [4.080205081267272, 4.079303183698418, 4.0786836706522145]
    3: [18.093742209318215, 18.079666699618095, 18.07107661757692]
    4: [18.852576768615222, 18.841418050790622, 18.840745369110437]
    >Exit code: 0


    The future is bright for ... ascii users.

    jmf
     
    jmfauth, Feb 7, 2013
    #4
  5. On Thu, Feb 7, 2013 at 10:08 PM, jmfauth <> wrote:
    > The future is bright for ... ascii users.
    >
    > jmf


    So you're admitting to being not very bright?

    *ducks*

    Seriously jmf, please don't hijack threads just to whine about
    contrived issues of Unicode performance yet again. That horse is dead.
    Go fork Python and reimplement buggy narrow builds if you want to, the
    rest of us are happy with a bug-free Python.

    ChrisA
     
    Chris Angelico, Feb 7, 2013
    #5
  6. rh wrote:

    > I am using 2.7.3 and I put the re.compile outside the function and it
    > performed faster than urlparse. I don't print out the data.


    I find that hard to believe. re.compile caches its results, so except for
    the very first time it is called, it is very fast -- basically a function
    call and a dict lookup. I find it implausible that a micro-optimization
    such as you describe could be responsible for speeding the code up by over
    33%.

    But since you don't demonstrate any actual working code, you could be
    correct, or you could be timing it wrong. Without seeing your timing code,
    my guess is that you are doing it wrong. Timing code is tricky, which is
    why I always show my work. If I get it wrong, someone will hopefully tell
    me. Otherwise, I might as well be making up the numbers.



    --
    Steven
     
    Steven D'Aprano, Feb 7, 2013
    #6
  7. Demian Brecht

    rh Guest

    On Fri, 08 Feb 2013 09:45:41 +1100
    Steven D'Aprano <> wrote:

    > rh wrote:
    >
    > > I am using 2.7.3 and I put the re.compile outside the function and
    > > it performed faster than urlparse. I don't print out the data.

    >
    > I find that hard to believe. re.compile caches its results, so except
    > for the very first time it is called, it is very fast -- basically a
    > function call and a dict lookup. I find it implausible that a
    > micro-optimization such as you describe could be responsible for
    > speeding the code up by over 33%.


    Not sure where you came up with that number. Maybe another post?
    I never gave any numbers, just comparisons.

    >
    > But since you don't demonstrate any actual working code, you could be
    > correct, or you could be timing it wrong. Without seeing your timing
    > code, my guess is that you are doing it wrong. Timing code is tricky,
    > which is why I always show my work. If I get it wrong, someone will
    > hopefully tell me. Otherwise, I might as well be making up the
    > numbers.


    re.compile
    starttime = time.time()
    for i in range(numloops):
    u2f()

    msg = '\nElapsed {0:.3f}'.format(time.time() - starttime)
    print(msg)

    >
    >
    >
    > --
    > Steven
    >



    --
     
    rh, Feb 7, 2013
    #7
  8. rh wrote:

    > On Fri, 08 Feb 2013 09:45:41 +1100
    > Steven D'Aprano <> wrote:
    >
    >> rh wrote:
    >>
    >> > I am using 2.7.3 and I put the re.compile outside the function and
    >> > it performed faster than urlparse. I don't print out the data.

    >>
    >> I find that hard to believe. re.compile caches its results, so except
    >> for the very first time it is called, it is very fast -- basically a
    >> function call and a dict lookup. I find it implausible that a
    >> micro-optimization such as you describe could be responsible for
    >> speeding the code up by over 33%.

    >
    > Not sure where you came up with that number. Maybe another post?


    That number comes from my post, which you replied to.

    http://mail.python.org/pipermail/python-list/2013-February/640056.html

    By the way, are you aware that you are setting the X-No-Archive header on
    your posts?



    > I never gave any numbers, just comparisons.
    >
    >>
    >> But since you don't demonstrate any actual working code, you could be
    >> correct, or you could be timing it wrong. Without seeing your timing
    >> code, my guess is that you are doing it wrong. Timing code is tricky,
    >> which is why I always show my work. If I get it wrong, someone will
    >> hopefully tell me. Otherwise, I might as well be making up the
    >> numbers.

    >
    > re.compile
    > starttime = time.time()
    > for i in range(numloops):
    > u2f()
    >
    > msg = '\nElapsed {0:.3f}'.format(time.time() - starttime)
    > print(msg)



    I suggest you go back to my earlier post, the one you responded to, and look
    at how I use the timeit module to time small code snippets. Then read the
    documentation for it, and the comments in the source code. If you can get
    hold of the Python Cookbook, read Tim Peters' comments in that.

    http://docs.python.org/2/library/timeit.html
    http://docs.python.org/3/library/timeit.html



    Oh, one last thing... pulling out "re.compile" outside of the function does
    absolutely nothing. You don't even compile anything. It basically looks up
    that a compile function exists in the re module, and that's all.



    --
    Steven
     
    Steven D'Aprano, Feb 7, 2013
    #8
  9. Demian Brecht

    Ian Kelly Guest

    On Thu, Feb 7, 2013 at 4:59 PM, Steven D'Aprano
    <> wrote:
    > Oh, one last thing... pulling out "re.compile" outside of the function does
    > absolutely nothing. You don't even compile anything. It basically looks up
    > that a compile function exists in the re module, and that's all.


    Using Python 2.7:

    >>> t1 = Timer("""

    .... nx = re.compile(r'https?://(.+)$')
    .... v = nx.search(u).group(1)
    .... ux = re.compile(r'([-:./?&=]+)')
    .... ux.sub('_', v)""", """
    .... import re
    .... u = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'""")
    >>> t2 = Timer("""

    .... v = nx.search(u).group(1)
    .... ux.sub('_', v)""", """
    .... import re
    .... nx = re.compile(r'https?://(.+)$')
    .... ux = re.compile(r'([-:./?&=]+)')
    .... u = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'""")
    >>> min(t1.repeat())

    11.625409933385388
    >>> min(t2.repeat())

    8.825254885746652

    Whatever caching is being done by re.compile, that's still a 24%
    savings by moving the compile calls into the setup.
     
    Ian Kelly, Feb 8, 2013
    #9
  10. Demian Brecht

    Ian Kelly Guest

    On Thu, Feb 7, 2013 at 5:55 PM, Ian Kelly <> wrote:
    > Whatever caching is being done by re.compile, that's still a 24%
    > savings by moving the compile calls into the setup.


    On the other hand, if you add an re.purge() call to the start of t1 to
    clear the cache:

    >>> t3 = Timer("""

    .... re.purge()
    .... nx = re.compile(r'https?://(.+)$')
    .... v = nx.search(u).group(1)
    .... ux = re.compile(r'([-:./?&=]+)')
    .... ux.sub('_', v)""", """
    .... import re
    .... u = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'""")
    >>> min(t3.repeat(number=10000))

    3.5532990924824617

    Which is approximately 30 times slower, so clearly the regular
    expression *is* being cached. I think what we're seeing here is that
    the time needed to look up the compiled regular expression in the
    cache is a significant fraction of the time needed to actually execute
    it.
     
    Ian Kelly, Feb 8, 2013
    #10
  11. Ian Kelly wrote:

    > On Thu, Feb 7, 2013 at 4:59 PM, Steven D'Aprano
    > <> wrote:
    >> Oh, one last thing... pulling out "re.compile" outside of the function
    >> does absolutely nothing. You don't even compile anything. It basically
    >> looks up that a compile function exists in the re module, and that's all.

    >
    > Using Python 2.7:

    [...]
    > Whatever caching is being done by re.compile, that's still a 24%
    > savings by moving the compile calls into the setup.


    That may or may not be the case, but rh didn't compile anything. He
    moved "re.compile" literally, with no arguments, out of the timing code.
    That clearly does nothing except confirm that re.compile exists.



    --
    Steven
     
    Steven D'Aprano, Feb 8, 2013
    #11
  12. Demian Brecht

    rh Guest

    On Fri, 08 Feb 2013 14:02:14 +1100
    Steven D'Aprano <> wrote:

    > Ian Kelly wrote:
    >
    > > On Thu, Feb 7, 2013 at 4:59 PM, Steven D'Aprano
    > > <> wrote:
    > >> Oh, one last thing... pulling out "re.compile" outside of the
    > >> function does absolutely nothing. You don't even compile anything.
    > >> It basically looks up that a compile function exists in the re
    > >> module, and that's all.

    > >
    > > Using Python 2.7:

    > [...]
    > > Whatever caching is being done by re.compile, that's still a 24%
    > > savings by moving the compile calls into the setup.

    >
    > That may or may not be the case, but rh didn't compile anything. He
    > moved "re.compile" literally, with no arguments, out of the timing
    > code. That clearly does nothing except confirm that re.compile exists.


    My initial post has the function and in there are two re.compile calls.
    I moved those out of the function and see repeatable time efficiency
    improvements.

    FWIW the fastest so far was posted by Peter Otten and didn't
    use regex.

    As a new learner of python (or any language) I like to know what
    habits will serve me well into the future. So the only reason I look
    at the time it takes is as a sanity check to make sure I'm not
    learning bad habits. In this case someone else pointed out time
    comparisons and off the thread went into timings!

    I did take note of your previous post using timeit and filed
    that away into the gray matter for some other day.

    >
    >
    >
    > --
    > Steven
    >



    --
     
    rh, Feb 8, 2013
    #12
  13. Demian Brecht

    rh Guest

    On Thu, 7 Feb 2013 18:08:00 -0700
    Ian Kelly <> wrote:

    > On Thu, Feb 7, 2013 at 5:55 PM, Ian Kelly <>
    > wrote:
    > > Whatever caching is being done by re.compile, that's still a 24%
    > > savings by moving the compile calls into the setup.

    >
    > On the other hand, if you add an re.purge() call to the start of t1 to
    > clear the cache:
    >
    > >>> t3 = Timer("""

    > ... re.purge()
    > ... nx = re.compile(r'https?://(.+)$')
    > ... v = nx.search(u).group(1)
    > ... ux = re.compile(r'([-:./?&=]+)')
    > ... ux.sub('_', v)""", """
    > ... import re
    > ... u = 'http://alongnameofasite1234567.com/q?sports=run&a=1&b=1'""")
    > >>> min(t3.repeat(number=10000))

    > 3.5532990924824617
    >
    > Which is approximately 30 times slower, so clearly the regular
    > expression *is* being cached. I think what we're seeing here is that
    > the time needed to look up the compiled regular expression in the
    > cache is a significant fraction of the time needed to actually execute
    > it.


    By "actually execute" you mean to apply the compiled expression
    to the search or sub? Or do you mean the time needed to compile
    the pattern into a regex obj?

    I presumed that compiling the pattern at each iteration was expensive
    and that's why I expected moving it out of the function to reduce the
    time needed to search/sub.
     
    rh, Feb 8, 2013
    #13
  14. Demian Brecht

    Dave Angel Guest

    On 02/07/2013 06:13 PM, rh wrote:
    > On Fri, 08 Feb 2013 09:45:41 +1100
    > Steven D'Aprano <> wrote:
    >
    >> <snip>
    >>
    >> But since you don't demonstrate any actual working code, you could be
    >> correct, or you could be timing it wrong. Without seeing your timing
    >> code, my guess is that you are doing it wrong. Timing code is tricky,
    >> which is why I always show my work. If I get it wrong, someone will
    >> hopefully tell me. Otherwise, I might as well be making up the
    >> numbers.

    >
    > re.compile


    That statement does explicitly nothing useful. It certainly doesn't
    compile anything, or call any regex code.

    > starttime = time.time()
    > for i in range(numloops):
    > u2f()
    >
    > msg = '\nElapsed {0:.3f}'.format(time.time() - starttime)
    > print(msg)
    >



    --
    DaveA
     
    Dave Angel, Feb 8, 2013
    #14
  15. Demian Brecht

    Ian Kelly Guest

    On Thu, Feb 7, 2013 at 10:57 PM, rh <> wrote:
    > On Thu, 7 Feb 2013 18:08:00 -0700
    > Ian Kelly <> wrote:
    >
    >> Which is approximately 30 times slower, so clearly the regular
    >> expression *is* being cached. I think what we're seeing here is that
    >> the time needed to look up the compiled regular expression in the
    >> cache is a significant fraction of the time needed to actually execute
    >> it.

    >
    > By "actually execute" you mean to apply the compiled expression
    > to the search or sub? Or do you mean the time needed to compile
    > the pattern into a regex obj?


    The former. Both are dwarfed by the time needed to compile the pattern.
     
    Ian Kelly, Feb 8, 2013
    #15
  16. Ian Kelly wrote:

    > On Thu, Feb 7, 2013 at 10:57 PM, rh <> wrote:
    >> On Thu, 7 Feb 2013 18:08:00 -0700
    >> Ian Kelly <> wrote:
    >>
    >>> Which is approximately 30 times slower, so clearly the regular
    >>> expression *is* being cached. I think what we're seeing here is that
    >>> the time needed to look up the compiled regular expression in the
    >>> cache is a significant fraction of the time needed to actually execute
    >>> it.

    >>
    >> By "actually execute" you mean to apply the compiled expression
    >> to the search or sub? Or do you mean the time needed to compile
    >> the pattern into a regex obj?

    >
    > The former. Both are dwarfed by the time needed to compile the pattern.


    Surely that depends on the size of the pattern, and the size of the data
    being worked on.

    Compiling the pattern "s[ai]t" doesn't take that much work, it's only six
    characters and very simple. Applying it to:

    "sazsid"*1000000 + "sat"

    on the other hand may be a tad expensive.

    Sweeping generalities about the cost of compiling regexes versus searching
    with them are risky.



    --
    Steven
     
    Steven D'Aprano, Feb 8, 2013
    #16
  17. Demian Brecht

    Ian Kelly Guest

    On Fri, Feb 8, 2013 at 4:43 AM, Steven D'Aprano
    <> wrote:
    > Ian Kelly wrote:
    > Surely that depends on the size of the pattern, and the size of the data
    > being worked on.


    Natually.

    > Compiling the pattern "s[ai]t" doesn't take that much work, it's only six
    > characters and very simple. Applying it to:
    >
    > "sazsid"*1000000 + "sat"
    >
    > on the other hand may be a tad expensive.
    >
    > Sweeping generalities about the cost of compiling regexes versus searching
    > with them are risky.


    I was referring to the specific timing measurements I made earlier in
    this thread, not generalizing.
     
    Ian Kelly, Feb 8, 2013
    #17
  18. On 08.02.13 03:08, Ian Kelly wrote:
    > I think what we're seeing here is that
    > the time needed to look up the compiled regular expression in the
    > cache is a significant fraction of the time needed to actually execute
    > it.


    There is a bug issue for this. See http://bugs.python.org/issue16389 .
     
    Serhiy Storchaka, Feb 15, 2013
    #18
  19. Demian Brecht

    rh Guest

    On Fri, 15 Feb 2013 22:58:30 +0200
    Serhiy Storchaka <> wrote:

    > On 08.02.13 03:08, Ian Kelly wrote:
    > > I think what we're seeing here is that
    > > the time needed to look up the compiled regular expression in the
    > > cache is a significant fraction of the time needed to actually
    > > execute it.

    >
    > There is a bug issue for this. See http://bugs.python.org/issue16389 .
    >


    I can't tell what is the problem, is it fixed or still in progress?
     
    rh, Feb 26, 2013
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rh
    Replies:
    6
    Views:
    126
    Nick Mellor
    Feb 8, 2013
  2. Demian Brecht
    Replies:
    0
    Views:
    129
    Demian Brecht
    Feb 6, 2013
  3. MRAB
    Replies:
    0
    Views:
    112
  4. Peter Otten
    Replies:
    0
    Views:
    89
    Peter Otten
    Feb 7, 2013
  5. Demian Brecht
    Replies:
    0
    Views:
    91
    Demian Brecht
    Feb 7, 2013
Loading...

Share This Page