Compiling Regexp only once

Discussion in 'Ruby' started by singsang, Sep 4, 2006.

  1. singsang

    singsang Guest

    Dear all,

    Writing some httpd logfile pre-processing (splitting it up, getting
    already some basic numbers), I think that I should compile the Regexp
    for the logfile entry only once.

    So my guess is that I should have perhaps a class LogFormat that holds
    this as a class variable or a class constant. Below I use a non-tested
    regular expression that is not complete yet.

    So the idea is to have:

    class LogFormat
    @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
    [+\-]\d+?\]' )
    def LogFormat.regex
    @@RegEx
    end
    end

    If now from a class LogLine (instanciated for each line in the logfile)
    I use something like

    class LogLine
    # ...
    ip, rfc931, user, day, month, year, offset =
    line.match(LogFormat.regex)
    # ...
    end

    My question: How often is the Regexp compiled? When?
    When the definition of LogFormat is read first?

    Btw: If anybody has a ready-to-use regex for the common log format this
    would be great, but I will get that done as far as I need by myself.
    ;-)
    Other question: Does anybody know a "Webalizer" sort of thing written
    in Ruby?

    Thank you all so much
    Herm (using a strange google account, I know)
    singsang, Sep 4, 2006
    #1
    1. Advertising

  2. singsang

    Jan Svitok Guest

    On 9/4/06, singsang <> wrote:
    > Dear all,
    >
    > Writing some httpd logfile pre-processing (splitting it up, getting
    > already some basic numbers), I think that I should compile the Regexp
    > for the logfile entry only once.
    >
    > So my guess is that I should have perhaps a class LogFormat that holds
    > this as a class variable or a class constant. Below I use a non-tested
    > regular expression that is not complete yet.
    >
    > So the idea is to have:
    >
    > class LogFormat
    > @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
    > [+\-]\d+?\]' )
    > def LogFormat.regex
    > @@RegEx
    > end
    > end
    >
    > If now from a class LogLine (instanciated for each line in the logfile)
    > I use something like
    >
    > class LogLine
    > # ...
    > ip, rfc931, user, day, month, year, offset =
    > line.match(LogFormat.regex)
    > # ...
    > end
    >
    > My question: How often is the Regexp compiled? When?
    > When the definition of LogFormat is read first?
    >
    > Btw: If anybody has a ready-to-use regex for the common log format this
    > would be great, but I will get that done as far as I need by myself.
    > ;-)
    > Other question: Does anybody know a "Webalizer" sort of thing written
    > in Ruby?


    Hi,

    <non-authoritative answer follows ;-) >

    - =~ seems to be (by an order) faster than String#match
    - maybe you can make a constant REGEX from @@RegEx, eliminating the
    need for #regex
    - it seems that when the definition is constant (i.e. no #{xxx}), only
    one object is created.
    Jan Svitok, Sep 4, 2006
    #2
    1. Advertising

  3. On 04.09.2006 14:11, singsang wrote:
    > Dear all,
    >
    > Writing some httpd logfile pre-processing (splitting it up, getting
    > already some basic numbers), I think that I should compile the Regexp
    > for the logfile entry only once.
    >
    > So my guess is that I should have perhaps a class LogFormat that holds
    > this as a class variable or a class constant. Below I use a non-tested
    > regular expression that is not complete yet.
    >
    > So the idea is to have:
    >
    > class LogFormat
    > @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
    > [+\-]\d+?\]' )
    > def LogFormat.regex
    > @@RegEx
    > end
    > end


    You don't need a class variable for that. A simple class instance
    variable or a constance is sufficient.

    > If now from a class LogLine (instanciated for each line in the logfile)
    > I use something like
    >
    > class LogLine
    > # ...
    > ip, rfc931, user, day, month, year, offset =
    > line.match(LogFormat.regex)
    > # ...
    > end
    >
    > My question: How often is the Regexp compiled? When?
    > When the definition of LogFormat is read first?


    It's compiled every time the line "@@RegEx = ..." is evaluated - so most
    likely only once.

    Note though that it's usually faster to have a regexp in line. So in
    your case you might have a method that does the line parsing (or
    multiple line parsing) and that's where you can put the inline regexp
    for max efficiency.

    Kind regards

    robert
    Robert Klemme, Sep 4, 2006
    #3
  4. singsang

    Eric Hodel Guest

    On Sep 4, 2006, at 5:15 AM, singsang wrote:

    > Writing some httpd logfile pre-processing (splitting it up, getting
    > already some basic numbers), I think that I should compile the Regexp
    > for the logfile entry only once.


    You should benchmark, I expect you will find you made matching slower.

    > So my guess is that I should have perhaps a class LogFormat that holds
    > this as a class variable or a class constant. Below I use a non-tested
    > regular expression that is not complete yet.
    >
    > class LogLine
    > # ...
    > ip, rfc931, user, day, month, year, offset =
    > line.match(LogFormat.regex)
    > # ...
    > end


    #match is the slowest way to use regular expressions.

    [ruby-talk:204747]

    --
    Eric Hodel - - http://blog.segment7.net
    This implementation is HODEL-HASH-9600 compliant

    http://trackmap.robotcoop.com
    Eric Hodel, Sep 5, 2006
    #4
  5. On 9/4/06, Robert Klemme <> wrote:

    > Note though that it's usually faster to have a regexp in line. So in
    > your case you might have a method that does the line parsing (or
    > multiple line parsing) and that's where you can put the inline regexp
    > for max efficiency.


    Robert,

    I'm not sure what you mean by inline, do you mean a regexp literal like
    /[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.

    See my benchmarks on the thread about "Ordered contrast for String or
    Array," where changing a literal /./ to a constant set to
    RegEx.new('/./') caused a 200 fold performance increase.

    Not that you always want to do that, benchmarking always beats rules of thumb.

    --
    Rick DeNatale

    My blog on Ruby
    http://talklikeaduck.denhaven2.com/
    Rick DeNatale, Sep 5, 2006
    #5
  6. singsang

    Eric Hodel Guest

    On Sep 4, 2006, at 4:29 PM, Rick DeNatale wrote:

    > On 9/4/06, Robert Klemme <> wrote:
    >
    >> Note though that it's usually faster to have a regexp in line. So in
    >> your case you might have a method that does the line parsing (or
    >> multiple line parsing) and that's where you can put the inline regexp
    >> for max efficiency.

    >
    > Robert,
    >
    > I'm not sure what you mean by inline, do you mean a regexp literal
    > like
    > /[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.
    >
    > See my benchmarks on the thread about "Ordered contrast for String or
    > Array," where changing a literal /./ to a constant set to
    > RegEx.new('/./') caused a 200 fold performance increase.


    Those aren't the same regular expression. I expect that your
    benchmark is flawed.

    Looking at it:

    $ ruby
    str = "abcdefghijklmnopqrstuvwxyz" * 5
    RE = Regexp.new('/./')
    p str.scan RE
    -:3: warning: parenthesize argument(s) for future version
    p str.scan /./
    -:4: warning: parenthesize argument(s) for future version
    []
    ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
    "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "a",
    "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o",
    "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c",
    "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q",
    "r", "s", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "d", "e",
    "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
    "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "d", "e", "f", "g",
    "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u",
    "v", "w", "x", "y", "z"]

    Also, you forgot about using String#split (but it is slower than
    unpack and scan).

    $ ruby -e 'p Regexp.new("/./").source, /./.source'
    "/./"
    "."

    The correct way to construct a Regexp from Regexp.new is to omit //
    in the pattern when it is a String.

    > Not that you always want to do that, benchmarking always beats
    > rules of thumb.


    Benchmarking and testing beat rules of thumb.

    --
    Eric Hodel - - http://blog.segment7.net
    This implementation is HODEL-HASH-9600 compliant

    http://trackmap.robotcoop.com
    Eric Hodel, Sep 5, 2006
    #6
  7. On 9/4/06, Eric Hodel <> wrote:
    > On Sep 4, 2006, at 4:29 PM, Rick DeNatale wrote:
    >
    > > On 9/4/06, Robert Klemme <> wrote:
    > >
    > >> Note though that it's usually faster to have a regexp in line. So in
    > >> your case you might have a method that does the line parsing (or
    > >> multiple line parsing) and that's where you can put the inline regexp
    > >> for max efficiency.

    > >
    > > Robert,
    > >
    > > I'm not sure what you mean by inline, do you mean a regexp literal
    > > like
    > > /[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.
    > >
    > > See my benchmarks on the thread about "Ordered contrast for String or
    > > Array," where changing a literal /./ to a constant set to
    > > RegEx.new('/./') caused a 200 fold performance increase.

    >
    > Those aren't the same regular expression. I expect that your
    > benchmark is flawed.


    Oops, right you are.

    --
    Rick DeNatale

    My blog on Ruby
    http://talklikeaduck.denhaven2.com/
    Rick DeNatale, Sep 5, 2006
    #7
  8. So, I've corrected the benchmark. I've also added another method
    using String#split as Eric suggested.

    Here's the code being benchmarked:
    rick@frodo:~/rubyscripts$ cat stringsplit.rb
    class String
    To_chars_regex = Regexp.new('.')
    To_chars_regex2 = Regexp.new(/./)
    To_chars_regex3 = /./

    def to_chars_array_with_unpack
    unpack('a'*length)
    end

    def to_chars_array_with_split
    split(//)
    end

    def to_chars_array_with_scan
    scan /./
    end

    def to_chars_array_with_scan_precomp
    scan To_chars_regex
    end

    def to_chars_array_with_scan_precomp2
    scan To_chars_regex2
    end

    def to_chars_array_with_scan_precomp3
    scan To_chars_regex3
    end
    end

    Note that I made tested three different "pre-compiled" regex's one
    with a Regex.new('.'), one with Regex.new(/./), and one a constant
    with the literal regex /./.

    Now here's the benchmark:
    rick@frodo:~/rubyscripts$ cat benchstringsplit.rb
    require 'benchmark'
    include Benchmark
    load 'stringsplit.rb'

    iters = 100
    bmbm do | x |
    str = "abcdefghijklmnopqrstuvwxyz" * 5
    5.times do
    x.report("unpack #{str.length} character string") do
    iters.times do
    str.to_chars_array_with_unpack
    end
    end
    str += str
    end

    str = "abcdefghijklmnopqrstuvwxyz" * 5
    5.times do
    x.report("scan-precomp #{str.length} character string") do
    iters.times do
    str.to_chars_array_with_scan_precomp
    end
    end
    str += str
    end

    str = "abcdefghijklmnopqrstuvwxyz" * 5
    5.times do
    x.report("scan-precomp2 #{str.length} character string") do
    iters.times do
    str.to_chars_array_with_scan_precomp2
    end
    end
    str += str
    end

    str = "abcdefghijklmnopqrstuvwxyz" * 5
    5.times do
    x.report("scan-precomp3 #{str.length} character string") do
    iters.times do
    str.to_chars_array_with_scan_precomp3
    end
    end
    str += str
    end

    str = "abcdefghijklmnopqrstuvwxyz" * 5
    5.times do
    x.report("scan #{str.length} character string") do
    iters.times do
    str.to_chars_array_with_scan
    end
    end
    str += str
    end

    str = "abcdefghijklmnopqrstuvwxyz" * 5
    5.times do
    x.report("split #{str.length} character string") do
    iters.times do
    str.to_chars_array_with_split
    end
    end
    str += str
    end
    end

    And here are the results:
    rick@frodo:~/rubyscripts$ ruby benchstringsplit.rb
    Rehearsal -----------------------------------------------------------------------
    unpack 130 character string 0.950000 0.000000 0.950000 (
    1.225834)
    unpack 260 character string 0.920000 0.000000 0.920000 (
    1.736296)
    unpack 520 character string 0.950000 0.010000 0.960000 (
    2.474714)
    unpack 1040 character string 1.000000 0.000000 1.000000 (
    2.142817)
    unpack 2080 character string 0.940000 0.000000 0.940000 (
    2.772912)
    scan-precomp 130 character string 2.210000 0.000000 2.210000 (
    5.032766)
    scan-precomp 260 character string 2.190000 0.000000 2.190000 (
    7.615967)
    scan-precomp 520 character string 2.160000 0.010000 2.170000 (
    5.815543)
    scan-precomp 1040 character string 2.110000 0.000000 2.110000 (
    5.582919)
    scan-precomp 2080 character string 2.220000 0.000000 2.220000 (
    4.290547)
    scan-precomp2 130 character string 2.190000 0.020000 2.210000 (
    4.990034)
    scan-precomp2 260 character string 2.160000 0.020000 2.180000 (
    4.715208)
    scan-precomp2 520 character string 2.150000 0.040000 2.190000 (
    3.891652)
    scan-precomp2 1040 character string 2.180000 0.010000 2.190000 (
    3.757304)
    scan-precomp2 2080 character string 2.130000 0.010000 2.140000 (
    3.308557)
    scan-precomp3 130 character string 2.100000 0.000000 2.100000 (
    2.134875)
    scan-precomp3 260 character string 2.210000 0.040000 2.250000 (
    2.793601)
    scan-precomp3 520 character string 2.140000 0.000000 2.140000 (
    2.504348)
    scan-precomp3 1040 character string 2.090000 0.040000 2.130000 (
    2.872274)
    scan-precomp3 2080 character string 2.040000 0.000000 2.040000 (
    2.113700)
    scan 130 character string 2.240000 0.020000 2.260000 (
    3.108414)
    scan 260 character string 2.230000 0.010000 2.240000 (
    3.767771)
    scan 520 character string 2.170000 0.010000 2.180000 (
    3.892586)
    scan 1040 character string 2.280000 0.000000 2.280000 (
    2.540717)
    scan 2080 character string 2.190000 0.000000 2.190000 (
    2.454891)
    split 130 character string 2.350000 0.010000 2.360000 (
    2.749489)
    split 260 character string 2.330000 0.000000 2.330000 (
    2.620174)
    split 520 character string 2.370000 0.020000 2.390000 (
    3.186766)
    split 1040 character string 2.310000 0.040000 2.350000 (
    2.848300)
    split 2080 character string 2.460000 0.000000 2.460000 (
    2.930465)
    ------------------------------------------------------------- total:
    60.280000sec

    user system total real
    unpack 130 character string 1.280000 0.040000 1.320000 (
    1.665192)
    unpack 260 character string 1.340000 0.040000 1.380000 (
    1.492881)
    unpack 520 character string 1.170000 0.010000 1.180000 (
    1.204762)
    unpack 1040 character string 1.190000 0.000000 1.190000 (
    1.855012)
    unpack 2080 character string 1.250000 0.000000 1.250000 (
    1.677772)
    scan-precomp 130 character string 2.500000 0.030000 2.530000 (
    3.720339)
    scan-precomp 260 character string 2.460000 0.010000 2.470000 (
    3.895554)
    scan-precomp 520 character string 2.470000 0.020000 2.490000 (
    3.777175)
    scan-precomp 1040 character string 2.430000 0.050000 2.480000 (
    2.955399)
    scan-precomp 2080 character string 2.520000 0.000000 2.520000 (
    2.828614)
    scan-precomp2 130 character string 2.330000 0.010000 2.340000 (
    2.668416)
    scan-precomp2 260 character string 2.280000 0.000000 2.280000 (
    2.322685)
    scan-precomp2 520 character string 2.310000 0.000000 2.310000 (
    2.335216)
    scan-precomp2 1040 character string 2.300000 0.000000 2.300000 (
    2.331397)
    scan-precomp2 2080 character string 2.370000 0.010000 2.380000 (
    2.477581)
    scan-precomp3 130 character string 2.400000 0.000000 2.400000 (
    2.456566)
    scan-precomp3 260 character string 2.340000 0.010000 2.350000 (
    2.416143)
    scan-precomp3 520 character string 2.550000 0.010000 2.560000 (
    3.359391)
    scan-precomp3 1040 character string 2.350000 0.040000 2.390000 (
    2.822663)
    scan-precomp3 2080 character string 2.350000 0.010000 2.360000 (
    2.649131)
    scan 130 character string 2.550000 0.050000 2.600000 (
    3.186714)
    scan 260 character string 2.490000 0.040000 2.530000 (
    3.098105)
    scan 520 character string 2.370000 0.000000 2.370000 (
    2.442281)
    scan 1040 character string 2.330000 0.000000 2.330000 (
    2.593032)
    scan 2080 character string 2.360000 0.010000 2.370000 (
    3.467826)
    split 130 character string 2.610000 0.030000 2.640000 (
    5.929269)
    split 260 character string 2.730000 0.040000 2.770000 (
    3.155784)
    split 520 character string 2.670000 0.030000 2.700000 (
    3.625916)
    split 1040 character string 2.730000 0.020000 2.750000 (
    8.130285)
    split 2080 character string 2.750000 0.010000 2.760000 (
    6.109507)

    So there are slight differences between using:
    PCR1 - A constant set to Regex('.')
    PCR2 - A constant set to Regex(/./)
    PCR3 A constant set to /./
    and
    LIT a literal in-line /./

    For this benchmark, for each string length tested, the techniques from
    fastest to slowest are;
    130 Characters:
    Rehearsal: PCR3(2.10), PCR1(2.21), PCR2(2.21), LIT(2.26)
    "Live": PCR2(2.34), PCR3(2.40), PCR1(2.53), LIT(2.60)
    260 Characters:
    Rehearsal: PCR2(2.18), PCR1(2.19), PCR3(2.25), LIT(2.25)
    "Live": PCR2(2.28), PCR3(2.40), PCR1(2.47), LIT(2.60)
    520 Characters:
    Rehearsal: PCR3(2.14), PCR1(2.17), LIT(2.18), PCR2(2.19)
    "Live": PCR2(2.31), LIT(2.37), PCR1(2.49), PCR3(2.56)
    1040 Characters:
    Rehearsal: PCR1(2.11), PCR3(2.13), PCR2(2.19), LIT(2.28)
    "Live": PCR2(2.30), LIT(2.33), PCR3(2.39), PCR1(2.48)
    2080 Characters:
    Rehearsal: PCR3(2.04), PCR2(2.14), LIT(2.19), PCR1(2.22)
    "Live": PCR3(2.36), LIT(2.37), PCR2(2.38), PCR1(2.52)

    So, at least from this benchmark it doesn't seem that in-line literal
    regular expressions are faster than pre-compiled ones. In fact they
    never came in first, although PCR3 which was a constant set to a
    literal regex did win 4 times.

    Is this significant? Who knows, and with a new RegExp engine coming
    the numbers will be different in future.

    So as usual the right approach is test, profile to find out what needs
    improvement, and benchmark.



    --
    Rick DeNatale

    My blog on Ruby
    http://talklikeaduck.denhaven2.com/
    Rick DeNatale, Sep 5, 2006
    #8
  9. singsang

    Guest

    On Tue, 5 Sep 2006, Rick DeNatale wrote:

    > So there are slight differences between using:
    > PCR1 - A constant set to Regex('.')
    > PCR2 - A constant set to Regex(/./)
    > PCR3 A constant set to /./
    > and
    > LIT a literal in-line /./
    >
    > For this benchmark, for each string length tested, the techniques from
    > fastest to slowest are;
    > 130 Characters:
    > Rehearsal: PCR3(2.10), PCR1(2.21), PCR2(2.21), LIT(2.26)
    > "Live": PCR2(2.34), PCR3(2.40), PCR1(2.53), LIT(2.60)
    > 260 Characters:
    > Rehearsal: PCR2(2.18), PCR1(2.19), PCR3(2.25), LIT(2.25)
    > "Live": PCR2(2.28), PCR3(2.40), PCR1(2.47), LIT(2.60)
    > 520 Characters:
    > Rehearsal: PCR3(2.14), PCR1(2.17), LIT(2.18), PCR2(2.19)
    > "Live": PCR2(2.31), LIT(2.37), PCR1(2.49), PCR3(2.56)
    > 1040 Characters:
    > Rehearsal: PCR1(2.11), PCR3(2.13), PCR2(2.19), LIT(2.28)
    > "Live": PCR2(2.30), LIT(2.33), PCR3(2.39), PCR1(2.48)
    > 2080 Characters:
    > Rehearsal: PCR3(2.04), PCR2(2.14), LIT(2.19), PCR1(2.22)
    > "Live": PCR3(2.36), LIT(2.37), PCR2(2.38), PCR1(2.52)
    >
    > So, at least from this benchmark it doesn't seem that in-line literal
    > regular expressions are faster than pre-compiled ones. In fact they
    > never came in first, although PCR3 which was a constant set to a
    > literal regex did win 4 times.


    Those numbers seem funny, like there is a lot of background noise
    affecting them.

    a = Regexp.new('.')
    b = Regexp.new(/./)
    c = /./

    irb(main):015:0> a == b
    => true
    irb(main):016:0> b == c
    => true
    irb(main):017:0> a == c
    => true

    They all produce equivalent Regexp objects.

    Further, you are benchmarking scan, which isn't the same as comparing
    matching with regular expressions.

    In a simple matching operation, an inline regexp, in the form of:

    foo =~ /bar/

    is the fastest way to match.

    Ruby specially optimizes that style. Eric Hodel explained it in a post
    from...some time this summer, I think.

    bar = /bar/
    bar = Regexp.new('bar')
    bar = Regexp/new(/bar/)

    are all equivalent, and

    foo =~ bar

    Will be slower.

    Bar = /bar/
    foo =~ Bar

    will be slower yet.

    Any variation that calls match() will be even slower yet, by a substantial
    margin.

    So, for regular expression matching, the fastest, if not prettiest,
    approach is to use the '=~ /expression/' syntax.


    Kirk Haines
    , Sep 5, 2006
    #9
  10. singsang

    Eric Hodel Guest

    Benchmarking (Was: Re: Compiling Regexp only once)

    On Sep 4, 2006, at 8:06 PM, Rick DeNatale wrote:

    > So, I've corrected the benchmark. I've also added another method
    > using String#split as Eric suggested.
    >
    > Here's the code being benchmarked:
    > rick@frodo:~/rubyscripts$ cat stringsplit.rb


    Putting the code here involves an extra method call. While they are
    all wrapped in a method call equally, doing less is always better.

    > class String
    > [methods]
    > end
    >
    > Note that I made tested three different "pre-compiled" regex's one
    > with a Regex.new('.'), one with Regex.new(/./), and one a constant
    > with the literal regex /./.


    You don't need three different benchmarks for this. Its easy to
    determine that these are the same with irb.

    > Now here's the benchmark:
    > rick@frodo:~/rubyscripts$ cat benchstringsplit.rb
    > require 'benchmark'
    > include Benchmark
    > load 'stringsplit.rb'
    >
    > iters = 100


    100 iterations is never enough. GC behavior, other processes waking
    up, etc. will all cause fluctuations in the benchmark.

    Use 100_000 or 1_000_000.

    > [benchmark code]


    Here's a more-meaningful version of yours:

    $ cat stringsbench.rb
    require 'benchmark'

    N = 100_000
    RE = /./
    str = "abcdefghijklmnopqrstuvwxyz" * 5

    Benchmark.bmbm do | x |
    x.report 'empty' do N.times do end end
    x.report 'str' do N.times do str end end
    x.report 'unpack' do N.times do str.unpack('a' * str.length) end
    end
    x.report 'scan RE' do N.times do str.scan RE end end
    x.report 'scan /./' do N.times do str.scan /./ end end
    x.report 'split //' do N.times do str.split // end end
    end

    $ ruby stringsbench.rb
    Rehearsal --------------------------------------------
    empty 0.030000 0.000000 0.030000 ( 0.088456)
    str 0.050000 0.000000 0.050000 ( 0.064515)
    unpack 14.370000 0.210000 14.580000 ( 20.264274)
    scan RE 24.240000 0.380000 24.620000 ( 36.296017)
    scan /./ 24.100000 0.310000 24.410000 ( 32.661832)
    split // 29.660000 0.360000 30.020000 ( 40.596177)
    ---------------------------------- total: 93.710000sec

    user system total real
    empty 0.030000 0.000000 0.030000 ( 0.045199)
    str 0.050000 0.010000 0.060000 ( 0.065739)
    unpack 14.500000 0.190000 14.690000 ( 19.813677)
    scan RE 23.850000 0.350000 24.200000 ( 33.028376)
    scan /./ 23.910000 0.320000 24.230000 ( 35.219831)
    split // 29.720000 0.470000 30.190000 ( 42.712222)

    with 'abc...xyz' * 25:

    Rehearsal --------------------------------------------
    empty 0.050000 0.000000 0.050000 ( 0.217969)
    str 0.050000 0.000000 0.050000 ( 0.087123)
    unpack 65.370000 0.710000 66.080000 ( 80.341765)
    scan RE 107.840000 1.030000 108.870000 (129.723449)
    scan /./ 110.320000 1.250000 111.570000 (142.540348)
    split // 135.220000 1.330000 136.550000 (164.418453)
    --------------------------------- total: 423.170000sec

    user system total real
    empty 0.030000 0.000000 0.030000 ( 0.040596)
    str 0.050000 0.000000 0.050000 ( 0.066506)
    unpack 62.130000 0.590000 62.720000 ( 70.183396)
    scan RE 107.730000 0.940000 108.670000 (125.308283)
    scan /./ 107.550000 0.950000 108.500000 (127.365719)
    split // 133.700000 1.020000 134.720000 (150.798611)

    > So, at least from this benchmark it doesn't seem that in-line literal
    > regular expressions are faster than pre-compiled ones.


    inline regular expressions are no less "pre-compiled" than regular
    expressions in a variable or constant.

    $ ruby -e '2.times do puts /./.object_id end'
    938970
    938970

    ('=~ /./' is faster than '=~ var' is faster than 'match anything',
    but for other reasons)

    If you want to test inline vs "pre-compiled" regular expressions you
    need to throw away all the parts that are the same and focus on what
    is different. In your benchmark it was what was on the right hand
    side of String#scan. Since the string was the same and the scan was
    the same, just throw those away.

    You end up with a benchmark like this:

    $ cat vs.rb
    require 'benchmark'

    N = 100_000_000

    RE = /./
    re = /./
    $re = /./
    @re = /./
    @@re = /./

    Benchmark.bmbm do |bm|
    bm.report 'empty' do N.times do end end
    bm.report 'lit' do N.times do /./ end end
    bm.report 'local' do N.times do re end end
    bm.report 'global' do N.times do $re end end
    bm.report 'instance' do N.times do @re end end
    bm.report 'class' do N.times do @@re end end
    bm.report 'constant' do N.times do RE end end
    end

    $ ruby vs.rb
    Rehearsal --------------------------------------------
    empty 27.930000 0.120000 28.050000 ( 31.550709)
    lit 48.240000 0.320000 48.560000 ( 57.422011)
    local 48.820000 0.290000 49.110000 ( 60.650110)
    global 49.920000 0.420000 50.340000 ( 62.156780)
    instance 55.740000 0.180000 55.920000 ( 61.027820)
    class 57.800000 0.190000 57.990000 ( 63.733099)
    constant 59.240000 0.330000 59.570000 (119.118487)
    --------------------------------- total: 349.540000sec

    user system total real
    empty 27.820000 0.160000 27.980000 ( 58.584698)
    lit 48.170000 0.170000 48.340000 ( 59.557450)
    local 48.720000 0.170000 48.890000 ( 53.775692)
    global 49.830000 0.180000 50.010000 ( 55.743586)
    instance 55.880000 0.370000 56.250000 ( 72.011443)
    class 57.910000 0.390000 58.300000 ( 67.037158)
    constant 59.230000 0.410000 59.640000 ( 77.993887)

    So literal regular expressions are "faster" by about 56% (after
    discounting loop overhead) (when you're performing 100 million
    retrievals only) (actual speedup in real code will probably be
    swallowed elsewhere or completely irrelevant).

    This follows from reading eval.c. Constant lookup involves a bunch
    of C function calls, but literal lookup just returns an Object stored
    in the parse tree.

    > In fact they never came in first, although PCR3 which was a
    > constant set to a
    > literal regex did win 4 times.


    In fact, when you adjust the stringsbench.rb you'll see /./ winning
    over RE. Set N to 10_000_000 (or more) and str to "abcde" (or just N
    high enough).

    > Is this significant? Who knows, and with a new RegExp engine coming
    > the numbers will be different in future.


    The difference was probably due to processes waking up, garbage
    collection and similar unevenly distributed events. With more
    iterations and more focused benchmarks you'll get better results.

    But ultimately, these types of microbenchmarks are not very useful.
    Yes, you will get a speedup using /./ over RE, but will your program
    really run long enough where it matters to make a difference? I bet
    not.

    > So as usual the right approach is test, profile to find out what needs
    > improvement, and benchmark.


    Be careful with your benchmarks, they are most useful when they are
    as small and simple as possible. Be sure to throw away all the
    irrelevant parts.

    Be careful with your benchmarks, they are most useless when they are
    as small and simple as possible. Be sure to understand how little
    speedup you'll get.

    --
    Eric Hodel - - http://blog.segment7.net
    This implementation is HODEL-HASH-9600 compliant

    http://trackmap.robotcoop.com
    Eric Hodel, Sep 5, 2006
    #10
  11. On 9/4/06, singsang <> wrote:
    > Dear all,
    >
    > Writing some httpd logfile pre-processing (splitting it up, getting
    > already some basic numbers), I think that I should compile the Regexp
    > for the logfile entry only once.
    >
    > So my guess is that I should have perhaps a class LogFormat that holds
    > this as a class variable or a class constant. Below I use a non-tested
    > regular expression that is not complete yet.
    >
    > So the idea is to have:
    >
    > class LogFormat
    > @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
    > [+\-]\d+?\]' )
    > def LogFormat.regex
    > @@RegEx
    > end
    > end
    >
    > If now from a class LogLine (instanciated for each line in the logfile)
    > I use something like
    >
    > class LogLine
    > # ...
    > ip, rfc931, user, day, month, year, offset =
    > line.match(LogFormat.regex)
    > # ...
    > end
    >
    > My question: How often is the Regexp compiled? When?
    > When the definition of LogFormat is read first?
    >
    > Btw: If anybody has a ready-to-use regex for the common log format this
    > would be great, but I will get that done as far as I need by myself.
    > ;-)
    > Other question: Does anybody know a "Webalizer" sort of thing written
    > in Ruby?
    >


    I tried to write something like that once.

    But It was very slow (especially because it did DNS resolving
    synchronously), and I never finished it.

    Thanks

    Michal
    Michal Suchanek, Sep 5, 2006
    #11
  12. Re: Benchmarking (Was: Re: Compiling Regexp only once)

    On 9/5/06, Eric Hodel <> wrote:
    > On Sep 4, 2006, at 8:06 PM, Rick DeNatale wrote:
    >
    > > So, I've corrected the benchmark. I've also added another method
    > > using String#split as Eric suggested.
    > >
    > > Here's the code being benchmarked:
    > > rick@frodo:~/rubyscripts$ cat stringsplit.rb

    >
    > Putting the code here involves an extra method call. While they are
    > all wrapped in a method call equally, doing less is always better.
    >
    > > class String
    > > [methods]
    > > end
    > >
    > > Note that I made tested three different "pre-compiled" regex's one
    > > with a Regex.new('.'), one with Regex.new(/./), and one a constant
    > > with the literal regex /./.

    >
    > You don't need three different benchmarks for this. Its easy to
    > determine that these are the same with irb.
    >
    > > Now here's the benchmark:
    > > rick@frodo:~/rubyscripts$ cat benchstringsplit.rb
    > > require 'benchmark'
    > > include Benchmark
    > > load 'stringsplit.rb'
    > >
    > > iters = 100

    >
    > 100 iterations is never enough. GC behavior, other processes waking
    > up, etc. will all cause fluctuations in the benchmark.
    >
    > Use 100_000 or 1_000_000.
    >
    > > [benchmark code]

    >
    > Here's a more-meaningful version of yours:


    [ Eric's benchmark code]

    then a lot of other insightful stuff.

    > But ultimately, these types of microbenchmarks are not very useful.
    > Yes, you will get a speedup using /./ over RE, but will your program
    > really run long enough where it matters to make a difference? I bet
    > not.
    >
    > > So as usual the right approach is test, profile to find out what needs
    > > improvement, and benchmark.

    >
    > Be careful with your benchmarks, they are most useful when they are
    > as small and simple as possible. Be sure to throw away all the
    > irrelevant parts.
    >
    > Be careful with your benchmarks, they are most useless when they are
    > as small and simple as possible. Be sure to understand how little
    > speedup you'll get.


    Which is the point after all. I think that we are actually in violent
    agreement about the role of benchmarking.

    The key thing is a useful benchmark is crafted to show the performance
    at a functional level which affects the particular application.

    My benchmarks came out of another thread in which I proposed some code
    and someone made the observation that my use of unpack was a quick way
    to split up a string into individual 1-character strings. I did a
    benchmark to see if I could find a faster way, and when the results
    seemed interesting to the "Compiling Regexp only once" thread, I
    mentioned there, when it was pointed out that I had a typo which
    invalidated that benchmark and re-did it.

    I benchmarked several different approaches to splitting a string into
    single character strings, a slightly higher level function than just a
    regexp match. And the results indicate that at that level using
    literal REs doesn't really make much difference.

    Doing something a particular way by rote, because you heard X, and
    assume that X both applies to your situation and still applies, might
    get your code written, but might over time, turn out to be folk
    wisdom.

    I'm old enough to remember when some folks programmed in PL/I, the
    early PL/I compiler did a very poor job of generating code for
    subroutine calls, which led to PL/I "best practice" documents
    recommending that subroutines be avoided at all costs! Subsequent
    compilers made this harmful advice unnecessary. Of course avoiding
    subroutines is a much bigger deal than exactly how best to represent a
    regexp in source code.

    The bottom line is that as is often said "premature optimization is
    the root of all evil." Better to first write clearly, then test, and
    then, if there is a performance problem, fix it by benchmarking,
    profiling and re-coding/re-factoring.

    I hope that we agree on that.
    --
    Rick DeNatale

    My blog on Ruby
    http://talklikeaduck.denhaven2.com/
    Rick DeNatale, Sep 5, 2006
    #12
  13. Re: Benchmarking (Was: Re: Compiling Regexp only once)

    Eric Hodel wrote:
    > So literal regular expressions are "faster" by about 56% (after
    > discounting loop overhead) (when you're performing 100 million
    > retrievals only) (actual speedup in real code will probably be
    > swallowed elsewhere or completely irrelevant).
    >
    > This follows from reading eval.c. Constant lookup involves a bunch of
    > C function calls, but literal lookup just returns an Object stored in
    > the parse tree.

    Interesting. Note that this is different from your benchmark in
    ruby-talk:204747, wherein much of the savings seems to come from not
    having to instantiate & populate a MatchData. Compare:

    require 'benchmark'

    N = 1_000_000

    string = 'a b c d e'

    Benchmark.bmbm do |bm|

    bm.report 'empty' do
    N.times do end
    end

    bm.report 'none' do
    N.times do string =~ /a (.) c\s. e/ end
    end

    bm.report '5' do
    N.times do string =~ /a (.) c\s. e/;5 end
    end

    bm.report '$1' do
    N.times do string =~ /a (.) c\s. e/;$1 end
    end

    bm.report '$&' do
    N.times do string =~ /a (.) c\s. e/;$& end
    end

    bm.report '$~' do
    N.times do string =~ /a (.) c\s. e/;$~ end
    end

    bm.report 'match' do
    N.times do string.match /a (.) c\s. e/ end
    end
    end
    __END__

    Rehearsal -----------------------------------------
    empty 0.141000 0.000000 0.141000 ( 0.204000)
    none 1.453000 0.000000 1.453000 ( 1.500000)
    5 1.578000 0.000000 1.578000 ( 1.625000)
    $1 3.500000 0.015000 3.515000 ( 3.546000)
    $& 3.640000 0.000000 3.640000 ( 3.875000)
    $~ 4.641000 0.016000 4.657000 ( 4.922000)
    match 5.313000 0.015000 5.328000 ( 5.422000)
    ------------------------------- total: 20.312000sec

    user system total real
    empty 0.156000 0.000000 0.156000 ( 0.156000)
    none 1.484000 0.000000 1.484000 ( 1.531000)
    5 1.563000 0.000000 1.563000 ( 1.610000)
    $1 3.468000 0.000000 3.468000 ( 3.937000)
    $& 3.484000 0.000000 3.484000 ( 3.875000)
    $~ 4.609000 0.016000 4.625000 ( 4.641000)
    match 5.219000 0.016000 5.235000 ( 5.313000)

    Devin
    Devin Mullins, Sep 5, 2006
    #13
  14. singsang

    Ryan Davis Guest

    Re: Benchmarking (Was: Re: Compiling Regexp only once)

    On Sep 5, 2006, at 7:10 AM, Devin Mullins wrote:

    > Eric Hodel wrote:
    >> So literal regular expressions are "faster" by about 56% (after
    >> discounting loop overhead) (when you're performing 100 million
    >> retrievals only) (actual speedup in real code will probably be
    >> swallowed elsewhere or completely irrelevant).
    >> This follows from reading eval.c. Constant lookup involves a
    >> bunch of C function calls, but literal lookup just returns an
    >> Object stored in the parse tree.

    > Interesting. Note that this is different from your benchmark in
    > ruby-talk:204747, wherein much of the savings seems to come from
    > not having to instantiate & populate a MatchData. Compare:


    Vastly different. He's calling scan/split instead of s =~ //. s =~ //
    is the fastest way to invoke a match as you avoid the method dispatch
    altogether.
    Ryan Davis, Sep 7, 2006
    #14
  15. Re: Benchmarking (Was: Re: Compiling Regexp only once)

    >> Interesting. Note that this is different from your benchmark in
    >> ruby-talk:204747, wherein much of the savings seems to come from not
    >> having to instantiate & populate a MatchData. Compare:

    >
    > Vastly different. He's calling scan/split instead of s =~ //. s =~ //
    > is the fastest way to invoke a match as you avoid the method dispatch
    > altogether.


    Hurh?

    I'll admit up front that I'm absolutely clueless about benchmarking, but
    didn't my last bench show that most of the savings with s =~ // doesn't
    have to do with the method dispatch, but with MatchData?

    (Sadly, in retrospect, it lacked any comparisons of the various ways to
    *reference* a regex -- literal, constant, etc. -- which was the meat of
    the bench to which I was responding. Oh well. Sleepyhead..)

    Devin
    Devin Mullins, Sep 7, 2006
    #15
  16. Rick DeNatale wrote:
    > On 9/4/06, Robert Klemme <> wrote:
    >
    >> Note though that it's usually faster to have a regexp in line. So in
    >> your case you might have a method that does the line parsing (or
    >> multiple line parsing) and that's where you can put the inline regexp
    >> for max efficiency.

    >
    > Robert,
    >
    > I'm not sure what you mean by inline, do you mean a regexp literal like
    > /[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.


    No, I meant that

    if /foo/ =~ something

    is faster than

    FOO = /foo/
    ....
    if FOO =~ something

    Maybe the term "inline" was confusing. It refers to the fact that the
    RX is defined at the place where it matches.

    $ ruby rxbm.rb
    Rehearsal ----------------------------------------------------
    inline match 0.781000 0.000000 0.781000 ( 0.781000)
    inline non match 1.500000 0.000000 1.500000 ( 1.500000)
    const match 0.969000 0.000000 0.969000 ( 0.985000)
    const non match 1.719000 0.000000 1.719000 ( 1.718000)
    ------------------------------------------- total: 4.969000sec

    user system total real
    inline match 0.797000 0.000000 0.797000 ( 0.797000)
    inline non match 1.484000 0.000000 1.484000 ( 1.485000)
    const match 1.000000 0.000000 1.000000 ( 1.000000)
    const non match 1.719000 0.000000 1.719000 ( 1.719000)


    (BM attached)

    Kind regards

    robert


    require 'benchmark'
    include Benchmark

    REP = 1_000_000
    RX = /foo/
    MATCH = ("foo" * 100).freeze
    NON_MATCH = ("fo" * 100).freeze

    bmbm do |b|
    b.report "inline match" do
    REP.times do
    /foo/ =~ MATCH
    end
    end

    b.report "inline non match" do
    REP.times do
    /foo/ =~ NON_MATCH
    end
    end

    b.report "const match" do
    REP.times do
    RX =~ MATCH
    end
    end

    b.report "const non match" do
    REP.times do
    RX =~ NON_MATCH
    end
    end

    end
    Robert Klemme, Sep 8, 2006
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    159
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    482
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
  4. Gancy
    Replies:
    4
    Views:
    176
    Rasto Levrinc
    Feb 3, 2005
  5. Uldis  Bojars
    Replies:
    2
    Views:
    190
    Janwillem Borleffs
    Dec 17, 2006
Loading...

Share This Page