Using regexes versus "in" membership test?

Discussion in 'Python' started by Victor Hooi, Dec 12, 2012.

  1. Victor Hooi

    Victor Hooi Guest

    Hi,

    I have a script that trawls through log files looking for certain error conditions. These are identified via certain keywords (all different) in thoselines

    I then process those lines using regex groups to extract certain fields.

    Currently, I'm using a for loop to iterate through the file, and a dict of regexes:

    breaches = {
    'type1': re.compile(r'some_regex_expression'),
    'type2': re.compile(r'some_regex_expression'),
    'type3': re.compile(r'some_regex_expression'),
    'type4': re.compile(r'some_regex_expression'),
    'type5': re.compile(r'some_regex_expression'),
    }
    ...
    with open('blah.log', 'r') as f:
    for line in f:
    for breach in breaches:
    results = breaches[breach].search(line)
    if results:
    self.logger.info('We found an error - {0} - {1}'.format(results.group('errorcode'), results.group('errormsg'))
    # We do other things with other regex groups as well.

    (This isn't the *exact* code, but it shows the logic/flow fairly closely).

    For completeness, the actual regexes look something like this:

    Also, my regexs could possibly be tuned, they look something like this:

    (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*\]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]\s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]\s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)

    (Feel free to suggest any tuning, if you think they need it).

    My question is - I've heard that using the "in" membership operator is substantially faster than using Python regexes.

    Is this true? What is the technical explanation for this? And what sort of performance characteristics are there between the two?

    (I couldn't find much in the way of docs for "in", just the brief mention here - http://docs.python.org/2/reference/expressions.html#not-in )

    Would I be substantially better off using a list of strings and using "in" against each line, then using a second pass of regex only on the matched lines?

    (Log files are compressed, I'm actually using bz2 to read them in, uncompressed size is around 40-50 Gb).



    Cheers,
    Victor
    Victor Hooi, Dec 12, 2012
    #1
    1. Advertising

  2. On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote:

    > Hi,
    >
    > I have a script that trawls through log files looking for certain error
    > conditions. These are identified via certain keywords (all different) in
    > those lines
    >
    > I then process those lines using regex groups to extract certain fields.

    [...]
    > Also, my regexs could possibly be tuned, they look something like this:
    >
    > (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*

    \]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]
    \s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]
    \s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?
    P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error
    string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)
    >
    > (Feel free to suggest any tuning, if you think they need it).


    "Tuning"? I think it needs to be taken out and killed with a stake to the
    heart, then buried in concrete! :)

    An appropriate quote:

    Some people, when confronted with a problem, think "I know,
    I'll use regular expressions." Now they have two problems.
    -- Jamie Zawinski

    Is this actually meant to be a single regex, or did your email somehow
    mangle multiple regexes into a single line?

    At the very least, you should write your regexes using the VERBOSE flag,
    so you can use non-significant whitespace and comments. There is no
    performance cost to using VERBOSE once they are compiled, but a huge
    maintainability benefit.


    > My question is - I've heard that using the "in" membership operator is
    > substantially faster than using Python regexes.
    >
    > Is this true? What is the technical explanation for this? And what sort
    > of performance characteristics are there between the two?


    Yes, it is true. The technical explanation is simple:

    * the "in" operator implements simple substring matching,
    which is trivial to perform and fast;

    * regexes are an interpreted mini-language which operate via
    a complex state machine that needs to do a lot of work,
    which is complicated to perform and slow.

    Python's regex engine is not as finely tuned as (say) Perl's, but even in
    Perl simple substring matching ought to be faster, simply because you are
    doing less work to match a substring than to run a regex.

    But the real advantage to using "in" is readability and maintainability.

    As for the performance characteristics, you really need to do your own
    testing. Performance will depend on what you are searching for, where you
    are searching for it, whether it is found or not, your version of Python,
    your operating system, your hardware.

    At some level of complexity, you are better off just using a regex rather
    than implementing your own buggy, complicated expression matcher: for
    some matching tasks, there is no reasonable substitute to regexes. But
    for *simple* uses, you should prefer *simple* code:

    [steve@ando ~]$ python -m timeit \
    > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
    > "'xyz' in data"

    100000 loops, best of 3: 4.17 usec per loop

    [steve@ando ~]$ python -m timeit \
    > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \
    > -s "from re import search" \
    > "search('xyz', data)"

    100000 loops, best of 3: 10.9 usec per loop



    > (I couldn't find much in the way of docs for "in", just the brief
    > mention here -
    > http://docs.python.org/2/reference/expressions.html#not-in )
    >
    > Would I be substantially better off using a list of strings and using
    > "in" against each line, then using a second pass of regex only on the
    > matched lines?


    That's hard to say. It depends on whether you are matching on a substring
    that will frequently be found close to the start of each line, or
    something else.

    Where I expect you *will* see a good benefit is:

    * you have many lines to search;
    * but only a few actually match the regex;
    * the regex is quite complicated, and needs to backtrack a lot;
    * but you can eliminate most of the "no match" cases with a simple
    substring match.

    If you are in this situation, then very likely you will see a big benefit
    from a two-pass search:

    for line in log:
    if any(substr in line for substr in list_of_substrings):
    # now test against a regex


    Otherwise, maybe, maybe not.


    --
    Steven
    Steven D'Aprano, Dec 13, 2012
    #2
    1. Advertising

  3. Victor Hooi

    Victor Hooi Guest

    Hi,

    That was actually *one* regex expression...lol.

    And yes, it probably is a bit convoluted.

    Thanks for the tip about using VERBOSE - I'll use that, and comment my regex - that's a useful tip.

    Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this:

    07:40:05.793627975 [Info ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5]

    Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs?

    And yes, based on what you said, I probably will use the "in" loop first outside the regex - the lines I'm searching for are fairly few compared to the overall log size.

    Cheers,
    Victor

    On Thursday, 13 December 2012 12:09:33 UTC+11, Steven D'Aprano wrote:
    > On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote:
    >
    >
    >
    > > Hi,

    >
    > >

    >
    > > I have a script that trawls through log files looking for certain error

    >
    > > conditions. These are identified via certain keywords (all different) in

    >
    > > those lines

    >
    > >

    >
    > > I then process those lines using regex groups to extract certain fields.

    >
    > [...]
    >
    > > Also, my regexs could possibly be tuned, they look something like this:

    >
    > >

    >
    > > (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*

    >
    > \]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]
    >
    > \s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]
    >
    > \s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (?
    >
    > P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error
    >
    > string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\)
    >
    > >

    >
    > > (Feel free to suggest any tuning, if you think they need it).

    >
    >
    >
    > "Tuning"? I think it needs to be taken out and killed with a stake to the
    >
    > heart, then buried in concrete! :)
    >
    >
    >
    > An appropriate quote:
    >
    >
    >
    > Some people, when confronted with a problem, think "I know,
    >
    > I'll use regular expressions." Now they have two problems.
    >
    > -- Jamie Zawinski
    >
    >
    >
    > Is this actually meant to be a single regex, or did your email somehow
    >
    > mangle multiple regexes into a single line?
    >
    >
    >
    > At the very least, you should write your regexes using the VERBOSE flag,
    >
    > so you can use non-significant whitespace and comments. There is no
    >
    > performance cost to using VERBOSE once they are compiled, but a huge
    >
    > maintainability benefit.
    >
    >
    >
    >
    >
    > > My question is - I've heard that using the "in" membership operator is

    >
    > > substantially faster than using Python regexes.

    >
    > >

    >
    > > Is this true? What is the technical explanation for this? And what sort

    >
    > > of performance characteristics are there between the two?

    >
    >
    >
    > Yes, it is true. The technical explanation is simple:
    >
    >
    >
    > * the "in" operator implements simple substring matching,
    >
    > which is trivial to perform and fast;
    >
    >
    >
    > * regexes are an interpreted mini-language which operate via
    >
    > a complex state machine that needs to do a lot of work,
    >
    > which is complicated to perform and slow.
    >
    >
    >
    > Python's regex engine is not as finely tuned as (say) Perl's, but even in
    >
    > Perl simple substring matching ought to be faster, simply because you are
    >
    > doing less work to match a substring than to run a regex.
    >
    >
    >
    > But the real advantage to using "in" is readability and maintainability.
    >
    >
    >
    > As for the performance characteristics, you really need to do your own
    >
    > testing. Performance will depend on what you are searching for, where you
    >
    > are searching for it, whether it is found or not, your version of Python,
    >
    > your operating system, your hardware.
    >
    >
    >
    > At some level of complexity, you are better off just using a regex rather
    >
    > than implementing your own buggy, complicated expression matcher: for
    >
    > some matching tasks, there is no reasonable substitute to regexes. But
    >
    > for *simple* uses, you should prefer *simple* code:
    >
    >
    >
    > [steve@ando ~]$ python -m timeit \
    >
    > > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \

    >
    > > "'xyz' in data"

    >
    > 100000 loops, best of 3: 4.17 usec per loop
    >
    >
    >
    > [steve@ando ~]$ python -m timeit \
    >
    > > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \

    >
    > > -s "from re import search" \

    >
    > > "search('xyz', data)"

    >
    > 100000 loops, best of 3: 10.9 usec per loop
    >
    >
    >
    >
    >
    >
    >
    > > (I couldn't find much in the way of docs for "in", just the brief

    >
    > > mention here -

    >
    > > http://docs.python.org/2/reference/expressions.html#not-in )

    >
    > >

    >
    > > Would I be substantially better off using a list of strings and using

    >
    > > "in" against each line, then using a second pass of regex only on the

    >
    > > matched lines?

    >
    >
    >
    > That's hard to say. It depends on whether you are matching on a substring
    >
    > that will frequently be found close to the start of each line, or
    >
    > something else.
    >
    >
    >
    > Where I expect you *will* see a good benefit is:
    >
    >
    >
    > * you have many lines to search;
    >
    > * but only a few actually match the regex;
    >
    > * the regex is quite complicated, and needs to backtrack a lot;
    >
    > * but you can eliminate most of the "no match" cases with a simple
    >
    > substring match.
    >
    >
    >
    > If you are in this situation, then very likely you will see a big benefit
    >
    > from a two-pass search:
    >
    >
    >
    > for line in log:
    >
    > if any(substr in line for substr in list_of_substrings):
    >
    > # now test against a regex
    >
    >
    >
    >
    >
    > Otherwise, maybe, maybe not.
    >
    >
    >
    >
    >
    > --
    >
    > Steven
    Victor Hooi, Dec 13, 2012
    #3
  4. On Thu, Dec 13, 2012 at 5:10 PM, Victor Hooi <> wrote:
    > Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this:
    >
    > 07:40:05.793627975 [Info ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5]
    >
    > Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs?


    If possible, can you do a simple test to find out whether or not you
    want a line and then do more complex parsing to get the info you want
    out of it? For instance, perhaps the presence of the word "ErrorCode"
    is all you need to check - it wouldn't hurt if you have a few percent
    of false positives that get discarded during the parse phase, it'll
    still be quicker to do a single string-in-string check than a complex
    regex to figure out if you even need to process the line at all.

    ChrisA
    Chris Angelico, Dec 13, 2012
    #4
  5. Victor Hooi

    Victor Hooi Guest

    Heya,

    See my original first post =):

    > Would I be substantially better off using a list of strings and using "in" against each line, then using a second pass of regex only on the matched lines?


    Based on what Steven said, and what I know about the logs in question, it's definitely better to do it that way.

    However, I'd still like to fix up the regex, or fix any glaring issues with it as well.

    Cheers,
    Victor

    On Thursday, 13 December 2012 17:19:57 UTC+11, Chris Angelico wrote:
    > On Thu, Dec 13, 2012 at 5:10 PM, Victor Hooi <> wrote:
    >
    > > Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this:

    >
    > >

    >
    > > 07:40:05.793627975 [Info ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5]

    >
    > >

    >
    > > Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs?

    >
    >
    >
    > If possible, can you do a simple test to find out whether or not you
    >
    > want a line and then do more complex parsing to get the info you want
    >
    > out of it? For instance, perhaps the presence of the word "ErrorCode"
    >
    > is all you need to check - it wouldn't hurt if you have a few percent
    >
    > of false positives that get discarded during the parse phase, it'll
    >
    > still be quicker to do a single string-in-string check than a complex
    >
    > regex to figure out if you even need to process the line at all.
    >
    >
    >
    > ChrisA
    Victor Hooi, Dec 13, 2012
    #5
  6. Victor Hooi

    Victor Hooi Guest

    Heya,

    See my original first post =):

    > Would I be substantially better off using a list of strings and using "in" against each line, then using a second pass of regex only on the matched lines?


    Based on what Steven said, and what I know about the logs in question, it's definitely better to do it that way.

    However, I'd still like to fix up the regex, or fix any glaring issues with it as well.

    Cheers,
    Victor

    On Thursday, 13 December 2012 17:19:57 UTC+11, Chris Angelico wrote:
    > On Thu, Dec 13, 2012 at 5:10 PM, Victor Hooi <> wrote:
    >
    > > Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this:

    >
    > >

    >
    > > 07:40:05.793627975 [Info ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5]

    >
    > >

    >
    > > Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs?

    >
    >
    >
    > If possible, can you do a simple test to find out whether or not you
    >
    > want a line and then do more complex parsing to get the info you want
    >
    > out of it? For instance, perhaps the presence of the word "ErrorCode"
    >
    > is all you need to check - it wouldn't hurt if you have a few percent
    >
    > of false positives that get discarded during the parse phase, it'll
    >
    > still be quicker to do a single string-in-string check than a complex
    >
    > regex to figure out if you even need to process the line at all.
    >
    >
    >
    > ChrisA
    Victor Hooi, Dec 13, 2012
    #6
  7. On Thu, Dec 13, 2012 at 5:35 PM, Victor Hooi <> wrote:
    > Heya,
    >
    > See my original first post =):
    >


    Oops! Mea culpa. Yes, you did say that. And then definitely, if it's a
    one-keyword search for each and then the regex to parse, that would be
    much more efficient than using the regex to pick lines.

    Sorry! I didn't read it properly.

    ChrisA
    Chris Angelico, Dec 13, 2012
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matthew Louden
    Replies:
    1
    Views:
    6,915
    Scott M.
    Oct 11, 2003
  2. Replies:
    0
    Views:
    518
  3. Paul Butcher
    Replies:
    12
    Views:
    711
    Gary Wright
    Nov 28, 2007
  4. soldier.coder

    Parsing HTML using regexes and arrays.

    soldier.coder, Nov 7, 2008, in forum: Ruby
    Replies:
    1
    Views:
    99
    Michael Libby
    Nov 7, 2008
  5. JMI
    Replies:
    2
    Views:
    123
Loading...

Share This Page