Slow Regex Code

Discussion in 'C++' started by brad, Jun 8, 2008.

  1. brad

    brad Guest

    Still learning C++. I'm writing some regex using boost. It works great.
    Only thing is... this code seems slow to me compared to equivelent Perl
    and Python. I'm sure I'm doing something incorrect. Any tips?

    #include <boost/regex.hpp>
    #include <iostream>

    // g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
    /usr/local/lib/libboost_regex-gcc41-mt-s.a
    // g++ numbers.cpp -o numbers.exe
    -Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib

    void number_search(const std::string& portion)
    {

    static const boost::regex Numbers("\\b\\d{9}\\b");
    static const boost::regex& rNumbers = Numbers;
    boost::smatch matches;

    std::string::const_iterator Start = portion.begin();
    std::string::const_iterator End = portion.end();

    while (boost::regex_search(Start, End, matches, rNumbers))
    {
    std::cout << matches.str() << std::endl;
    Start = matches[0].second;
    }
    }

    int main ()
    {
    std::string portion;
    while (std::getline(std::cin, portion))
    {
    number_search(portion);
    }
    return 0;
    }
     
    brad, Jun 8, 2008
    #1
    1. Advertising

  2. brad

    James Kanze Guest

    On Jun 8, 6:32 pm, brad <> wrote:
    > Still learning C++. I'm writing some regex using boost. It
    > works great. Only thing is... this code seems slow to me
    > compared to equivelent Perl and Python.


    Seems slow, or is measurably slower. There are two
    possibilities:

    1. it only seems slower, because the rest of the code is
    significantly faster, or

    2. it really is slower, because perl and python can compile it
    into some sort of efficient byte code, since they already
    have an "execution" machine for such byte code loaded.

    Note that pure (non-extended) regular expressions can be made to
    run considerably faster, since they can be converted to a pure
    DFA. My own regular expression class does this. For most
    purposes, however, boost:regex will be fast enough, and worth
    the added flexibility. (My own regular expression class was
    designed for a very specific use. Where it doesn't need the
    extensions, but it does need some additional features which
    aren't in Boost. For most general use, boost::regex is
    preferable.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Jun 9, 2008
    #2
    1. Advertising

  3. brad wrote:
    > // g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
    > /usr/local/lib/libboost_regex-gcc41-mt-s.a
    > // g++ numbers.cpp -o numbers.exe
    > -Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib


    For starters, you could try adding some optimization flags, such as
    -O3 and -march=<your architecture> (eg. -march=pentium4).

    (No, I don't know if that will make the regexp matching faster, but it
    doesn't hurt to try.)
     
    Juha Nieminen, Jun 9, 2008
    #3
  4. On Sun, 08 Jun 2008 12:32:30 -0400, brad <> wrote:
    >I'm writing some regex using boost. It works great.
    >Only thing is... this code seems slow to me compared to equivelent Perl
    >and Python. I'm sure I'm doing something incorrect. Any tips?


    Try PCRE.



    --
    Roland Pibinger
    "The best software is simple, elegant, and full of drama" - Grady Booch
     
    Roland Pibinger, Jun 9, 2008
    #4
  5. brad

    Mirco Wahab Guest

    brad wrote:
    > Still learning C++. I'm writing some regex using boost. It works great.
    > Only thing is... this code seems slow to me compared to equivelent Perl
    > and Python. I'm sure I'm doing something incorrect. Any tips?


    It's not necessarily slower. But most probably. This caught my attention,
    so I did some tests. Your code mainly messes around with the
    initialization stuff within the function. This has nothing to
    do w/boost regex.

    I modified your code to do the following:

    - slurp (read-into-buffer) a >120MB text file (actually,
    it's the Nietzsche full text, 8 times copied ;-)
    - find all "free" numbers >= 10 (that have 2 digits and
    word boundaries on the left & right sides)
    - show the total count of these numbers
    - do the same in Perl.

    The results (multicore results are "single-threaded"):

    [Windows XP-32, Athlon-64/3200+,@2290MHz]
    - Visual Studio 2008 + Boost 1.35.0 9.3 sec
    - Perl 5.10 (Active-) 10.4 sec

    [Linux 2.6.23, Pentium4,@2660MHz]
    - gcc 4.3, -O2, Boost 1.33.1 13.2 sec
    - Perl 5.8.8 8.2 sec

    [Linux 2.6.23, Core2/Q6600,@3240MHz]
    - gcc 4.3, -O2, Boost 1.33.1 6.3 sec
    - Perl 5.8.8 (i586, use64bitint=undef) 3.2 sec

    [Linux 2.6.24, Core2/Q9300,@3338MHz]
    - gcc 4.3, -O2, Boost 1.34.1 'std::runtime_error' (??)
    - Perl 5.10 (i586, use64bitint=undef) 10.4 sec

    The latter system is not installed completely
    (it's a test w/SuSE 11 Release Candidate),
    so the results may get better soon there ;-)


    Code, C++:
    ==>
    #include <boost/regex.hpp>
    #include <fstream>
    #include <iostream>

    int number_count(const char*block, size_t len)
    {
    boost::match_flag_type flags = boost::match_default;
    boost::regex reg("\\b\\d{2,}\\b");
    boost::cmatch m;

    const char *from = block, *to = block+len;
    int n = 0;
    while( boost::regex_search(from, to, m, reg, flags) ) {
    from = m[0].second, ++n;
    }
    return n;
    }

    int main ()
    {
    std::ifstream in("nietzsche8.txt"); // this is a 112 MB file,
    // it's 8 x the Nietzsche
    if(in) { // fulltext in plain ASCII
    in.seekg(0, std::ios::end); // get to EOF
    unsigned int len = in.tellg(); // read file pointer
    in.seekg(0, std::ios::beg); // back to pos 0

    char *block = new char [len+1]; // don't be stingy
    in.read(block, len); // slurp the file
    int n = number_count(block, len); // process data
    std::cout << "The text (" << len/1024 << "KB) has "
    << n << " numbers >= 10!" << std::endl;
    delete [] block; // play fair
    }
    return 0;
    }
    <==

    Code, Perl:

    ==>
    open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
    my $block;
    do { local $/; $block = <$fh> };
    close $fh;

    my $n;
    ++$n while $block =~ /\b\d{2,}\b/g; # process data
    print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";
    <==

    Regards

    Mirco
     
    Mirco Wahab, Jun 9, 2008
    #5
  6. brad

    peter koch Guest

    On 8 Jun., 18:32, brad <> wrote:
    > Still learning C++. I'm writing some regex using boost. It works great.
    > Only thing is... this code seems slow to me compared to equivelent Perl
    > and Python. I'm sure I'm doing something incorrect. Any tips?
    >
    > #include <boost/regex.hpp>
    > #include <iostream>
    >
    > // g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
    > /usr/local/lib/libboost_regex-gcc41-mt-s.a
    > // g++ numbers.cpp -o numbers.exe
    > -Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib
    >
    > void number_search(const std::string& portion)
    >    {
    >
    >      static const boost::regex Numbers("\\b\\d{9}\\b");
    >      static const boost::regex& rNumbers = Numbers;
    >      boost::smatch matches;
    >
    >      std::string::const_iterator Start = portion.begin();
    >      std::string::const_iterator End = portion.end();
    >
    >      while (boost::regex_search(Start, End, matches, rNumbers))
    >        {
    >        std::cout << matches.str() << std::endl;
    >        Start = matches[0].second;
    >        }
    >    }
    >
    > int main ()
    >    {
    >    std::string portion;
    >    while (std::getline(std::cin, portion))
    >        {
    >        number_search(portion);
    >        }
    >    return 0;
    >    }


    As others have pointed out, there are probably two factors here:

    - you might not be optimising your code. This can easily cause a
    factor of 5-10.
    - you might be measuring other parts of the library. I/O is the
    obvious answer, and if you are using Microsofts newer C++ compilers
    you might also be caught by the secure stl-code that is only disabled
    when you add a special define to your build.

    I would not expect this kind of code to be fast compared to e.g. Perl.
    Perl is sort of built with regex in mind, and that part probably is
    heavily optimised - maybe even written (partly) in assembly.

    /Peter
     
    peter koch, Jun 9, 2008
    #6
  7. brad

    Mirco Wahab Guest

    Razii wrote:
    > On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch
    > <> wrote:
    >> Perl is sort of built with regex in mind, and that part probably is
    >> heavily optimised - maybe even written (partly) in assembly.

    >
    > Perl regex apparently is much slower than Tcl.


    This is like saying: a rocket is much faster than an
    airplaine. It is true sometimes but means nothing.

    From my own experience, P5-REs are much more ver-
    satile compared to TCL-RE (P5-REs are not 'regular'
    anymore) and in the hands of an experienced pro-
    grammer, this difference (which might be notable some-
    times if many alternations are involved) approaches zero.

    For example - there used to be an algorithm oriented language
    implementation comparision (http://shootout.alioth.debian.org)
    where you may find all sorts of results. In a reverse-DNA dump
    test (http://shootout.alioth.debian.org/gp4/benchmark.php?test=revcomp&lang=all)
    Perl completes in 2 seconds, TCL in 11 seconds. In another Regex-
    heavy test (http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all),
    TCL runs in 3.3 seconds, whereas the first (allowed) Perl
    impelentation comes in in 12 seconds. But, using a more
    Perl-like approach (not allowed in this contest), the Perl
    program (Perl #3, Perl #6 on the bottom) will complete in
    1.2 seconds.

    Regards

    Mirco
     
    Mirco Wahab, Jun 10, 2008
    #7
  8. brad

    Mirco Wahab Guest

    boost::regex - open ranges a no no? was: Slow Regex Code

    Mirco Wahab wrote:

    I modified the expression:

    > ...
    > boost::regex reg("\\b\\d{2,}\\b");
    > ...


    to:
    ...
    boost::regex reg("\\b\\d\\d+\\b");
    ...

    with tremendeous improvements:

    > [Windows XP-32, Athlon-64/3200+,@2290MHz]
    > - Visual Studio 2008 + Boost 1.35.0 9.3 sec
    > - Perl 5.10 (Active-) 10.4 sec


    [Windows XP(32bit), Athlon-64/3200+ @2290MHz]
    Visual Studio 2008 + Boost 1.35.0 1.8 sec
    Perl 5.10.003 (AP, use64bitint=undef) 9.5 sec

    > [Linux 2.6.23, Pentium4,@2660MHz]
    > - gcc 4.3, -O2, Boost 1.33.1 13.2 sec
    > - Perl 5.8.8 8.2 sec


    [Linux 2.6.23(32bit), Pentium4/NW @2660MHz]
    gcc 4.3.1 -O2, Boost 1.33.1 1.2 sec (user)
    Perl 5.8.8 (32bit, use64bitint=undef) 6.2 sec (user)

    > [Linux 2.6.23, Core2/Q6600,@3240MHz]
    > - gcc 4.3, -O2, Boost 1.33.1 6.3 sec
    > - Perl 5.8.8 (i586, use64bitint=undef) 3.2 sec


    [Linux 2.6.23(32bit), Core2/Q6600,@3240MHz]
    gcc 4.3.1 -O2, Boost 1.33.1 0.55sec (user)
    Perl 5.8.8 (32bit, use64bitint=undef) 2.4 sec (user)

    > [Linux 2.6.24, Core2/Q9300,@3338MHz]
    > - gcc 4.3, -O2, Boost 1.34.1 'std::runtime_error' (??)
    > - Perl 5.10 (i586, use64bitint=undef) 10.4 sec


    [Linux 2.6.25(32bit), Core2/Q9300,@3338MHz]
    gcc 4.3.1, -O3, Boost 1.34.1 0.42sec (user) [*]
    Perl 5.10.0 (32bit, use64bitint=undef) 4.0 sec (user)

    [*] => after kernel update & gcc update,
    g++ -O3 -c boostrg.cxx -o boostrg.o
    works now


    modified Code, C++:
    ==>
    #include <boost/regex.hpp>
    #include <fstream>
    #include <iostream>


    int number_count(const char *block, unsigned int len)
    {
    boost::match_flag_type flags = boost::match_default;
    boost::regex reg("\\b\\d\\d+\\b");
    boost::cmatch what;

    const char *from = block, *to = block+len;
    int n = 0;
    while( boost::regex_search(from, to, what, reg, flags) ) {
    from = what[0].second;
    ++n;
    }
    return n;
    }

    int main ()
    {
    std::ifstream in("nietzsche8.txt"); // this is a 112 MB file,
    // it's 8 x the Nietzsche
    if(in) { // fulltext in plain ASCII
    in.seekg(0, std::ios::end); // get to EOF
    unsigned int len = in.tellg(); // read file pointer
    in.seekg(0, std::ios::beg); // back to pos 0

    char *block = new char [len+1]; // don't be stingy
    in.read(block, len); // slurp the file
    int n = number_count(block, len); // process data
    std::cout << "The text (" << len/1024 << "KB) has "
    << n << " numbers >= 10!" << std::endl;
    delete [] block; // play fair
    }
    return 0;
    }
    <==

    modified Code, Perl:
    ==>

    open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
    my $block;
    do { local $/; $block = <$fh> };
    close $fh;

    my $n;
    ++$n while $block =~ /\b\d\d+\b/g; # process data
    print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";

    <==


    At least for me, a very interesting difference.
    Boost::Regex gives Perl a significant margin.

    Regards

    Mirco
     
    Mirco Wahab, Jun 10, 2008
    #8
  9. brad

    Mirco Wahab Guest

    Razii wrote:
    > How do you know that Tcl won't speed up and remain faster than Perl if
    > it's allowed to split the regex at |


    It may or it may not. But the difference
    will most probably approach zero, as I
    tried to say.

    Regards

    Mirco
     
    Mirco Wahab, Jun 10, 2008
    #9
  10. brad

    brad Guest

    Re: boost::regex - open ranges a no no? was: Slow Regex Code

    Mirco Wahab wrote:
    > Mirco Wahab wrote:
    >
    > I modified the expression:
    >
    >> ...
    >> boost::regex reg("\\b\\d{2,}\\b");
    >> ...

    >
    > to:
    > ...
    > boost::regex reg("\\b\\d\\d+\\b");


    Wow... I changed my RE to use \\d nine times instead of \\d{9} and it's
    now twice as fast. Amazing. I never would have thought of something as
    simple as this. Thanks for the idea.
     
    brad, Jun 10, 2008
    #10
  11. brad

    Mirco Wahab Guest

    Razii wrote:
    > How do you know when you have not even tried it yet? Perhaps Tcl will
    > be still twice faster than Perl.
    > In any case, Tcl regex is faster
    > http://swtch.com/~rsc/regexp/regexp1.html


    I'm sure you have a reason to follow your opinion about that.
    Despite of that, I tested this on my box where I installed
    a Tcl 8.4 into Cygwin and have a Tcl 8.5 from Activestate
    around (Athlon-64/3200+).

    [cygwin on WinXP 32bit]
    $ time /usr/bin/tclsh84 boo.tcl ==> user 0m6.874s
    $ time /usr/bin/perl boo.pl ==> user 0m4.155s

    (BONUS #1: .tcl via XP-installed Active-Tcl 8.5)
    $ time /cygdrive/d/Tcl/bin/tclsh85.exe boo.tcl ==> real 0m5.633s

    (BONUS #2: C++, Win32-mingw-3.4.2 + Boost 1.33.1)
    $ time dcboo/boostrg.exe ==> real 0m1.952s

    So there is, regarding my implementation (I don't have
    much experience in Tcl programming), a winner here.


    Here's the code (the file in question is 112415 KB
    containing 823968 numbers >= 10):

    [Tcl] ==>

    set fl [file size "nietzsche8.txt"]
    set fh [open "nietzsche8.txt" r]
    set block [read $fh $fl]
    close $fh

    set n [regexp -all {\y\d\d+\y} $block]
    set k [expr {$fl / 1024}]
    puts "The text ($k KB) has $n numbers >= 10!\n";

    <==

    [Perl] ==>

    open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
    my $block;
    do { local $/; $block = <$fh> };
    close $fh;

    my $n;
    ++$n while $block =~ /\b\d\d+\b/g; # process data
    print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";

    <==

    Regards

    Mirco
     
    Mirco Wahab, Jun 10, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    3,054
  2. HK
    Replies:
    3
    Views:
    463
  3. mike
    Replies:
    3
    Views:
    404
    Virgil Green
    Jul 11, 2005
  4. JosephByrns

    Slow, then quick then slow

    JosephByrns, Jul 10, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    2,507
    codezilla94
    Nov 13, 2007
  5. Replies:
    3
    Views:
    794
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page