Standard C Library regex performance issue

I

igor.kulkin

I have a small utility program written in Python which works pretty
slow so I've decided to implement it in C.
I did some benchmarking of Python's code performance. One of the parts
of the program is using Python's standard re (regular expressions)
module to parse the input file. As Python's routines to read from the
file and regular expressions are most likely implemented via native
libraries I would expect that the C code, which reads and parses the
file using exactly the same scheme, would show approximately the same
performance (or maybe better).
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
I am running it all under Unix (I guess the version should not really
matter) and am using gcc with -O2 to compile C code.

The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)
3. reads input file line by line
4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?
 
H

Hallvard B Furuseth

I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
(...)
The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)

These probably use different regexp implementations with quite different
semantics. Maybe you didn't translate your Python regexps to the
equivalent regex.h regexps. Or maybe you use some very inefficient
regexps which Python re manages to optimize but regex.h does not.
One can write _very_ slow regexps with great ease.
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)

Hopefully you do this just once, outside the loop?
3. reads input file line by line

How? One fgets() into a char buffer[], or something more clever?
4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?

Might also be related to something quite else in your code, which you
haven't mentioned. Hard to guess when you don't post your code.
 
C

Christopher Layne

The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)
3. reads input file line by line
4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?

See if Python was compiled to use PCRE or not. IF so, then your version using
regcomp and regexec is not the same. comp.unix.programmer territory btw.
 
I

igor.kulkin

These probably use different regexp implementations with quite different
semantics. Maybe you didn't translate your Python regexps to the
equivalent regex.h regexps.

I've actually tryed running the regexps which I adjusted for regex.h
using regex.h and they match what I would want them to match
perfectly.
Or maybe you use some very inefficient
regexps which Python re manages to optimize but regex.h does not.
One can write _very_ slow regexps with great ease.

That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):

^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\\(([[:digit:]]*)\\)))?: (.*)\\n?$

It simply matches the line in the log file and has no fancy stuff
involved.
Hopefully you do this just once, outside the loop?

Sure I do.
3. reads input file line by line

How? One fgets() into a char buffer[], or something more clever?

The way I read the file should not matter as I've tryed running read-
file-line-by-line code separately (I've commented regexps stuff) and
it ran really fast.
Might also be related to something quite else in your code, which you
haven't mentioned. Hard to guess when you don't post your code.

Here is the Pthon code I've benchmarked:

import re

PATTERN = re.compile(r"^(\w{3}\s*\d{1,2}\s*\d{1,2}\:\d{1,2}\:\d{1,2})
\s*(([^\s]+)\[(\d*)\])\s*([^\s]+)\s*(([^\s]+)\((\d*)\))\:\s(.*?)\n?$")

fp = open("some.log", "r")

for line in fp:
mo = PATTERN.match(line)

fp.close()



And here is the C code:


#include <stdio.h>
#include <regex.h>


// Just a helper function.
char * read_line(FILE * in) {
size_t line_len;
char * buf;
buf = fgetln(in, &line_len);

if (!buf) return NULL;

while (line_len > 0 && (buf[line_len - 1] == (char) 10 ||
buf[line_len - 1] == (char) 13)) line_len--;

char *line = malloc(line_len + 1);

strncpy(line, buf, line_len);

line[line_len] = (char) 0;

return line;
}



int main(int argc, char **argv) {
regex_t regex;
int errc = regcomp(&regex, "^([a-zA-Z_]{3} +[0-9]{1,2} +[0-9]{1,2}:
[0-9]{1,2}:[0-9]{1,2}) +([^ ]+)-([0-9]+)-([a-zA-Z_]+)\\[([0-9]+)\\] +
([^ ]+)( +(([^ ]+)\\(([0-9]*)\\)))?: (.*)\\n?$", REG_EXTENDED |
REG_ICASE);
if (errc) {
// error processing logic.
// never happens as Regex compiles normally.
}



int n = regex.re_nsub + 1;
regmatch_t *pmatch = malloc(n * sizeof(regmatch_t));

FILE * in = fopen(files, "r");

while (1) {
char * line = read_line(in);
if (!line) break;
regexec(&regex, line, n, pmatch, 0); // I've commented this line to
test reading performance
free(line);
}

fclose(in);

return 0;
}
 
C

Christopher Layne

That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):

^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\\(([[:digit:]]*)\\)))?: (.*)\\n?$

This is hideous.
It simply matches the line in the log file and has no fancy stuff
involved.

Lots of fancy stuff involved for dubious reasons.
The way I read the file should not matter as I've tryed running read-
file-line-by-line code separately (I've commented regexps stuff) and
it ran really fast.

So you're saying it runs very fast when you read a file line by line but
doesn't run fast when you don't? Well yes then of course it matters.
Here is the Pthon code I've benchmarked:

import re

PATTERN = re.compile(r"^(\w{3}\s*\d{1,2}\s*\d{1,2}\:\d{1,2}\:\d{1,2})
\s*(([^\s]+)\[(\d*)\])\s*([^\s]+)\s*(([^\s]+)\((\d*)\))\:\s(.*?)\n?$")

----
You can set regex matching modes by specifying a special constant as a third
parameter to re.search(). re.I or re.IGNORECASE applies the pattern case
insensitively. re.S or re.DOTALL makes the dot match newlines. re.M or
re.MULTILINE makes the caret and dollar match after and before line breaks in
the subject string. There is no difference between the single-letter and
descriptive options, except for the number of characters you have to type in.
To specify more than one option, "or" them together with the | operator:
re.search("^a", "abc", re.I | re.M).

By default, Python's regex engine only considers the letters A through Z, the
digits 0 through 9, and the underscore as "word characters". Specify the flag
re.L or re.LOCALE to make \w match all characters that are considered letters
given the current locale settings. Alternatively, you can specify re.U or
re.UNICODE to treat all letters from all scripts as word characters. The
setting also affects word boundaries.
----

The above implies that Pyhton's newline mode is *ON* by default. POSIX
regcomp() is NOT newline on by default.
fp = open("some.log", "r")

for line in fp:
mo = PATTERN.match(line)

fp.close()

Which is line by line.
And here is the C code:


#include <stdio.h>
#include <regex.h>


// Just a helper function.
char * read_line(FILE * in) {
size_t line_len;
char * buf;
buf = fgetln(in, &line_len);

if (!buf) return NULL;

while (line_len > 0 && (buf[line_len - 1] == (char) 10 ||
buf[line_len - 1] == (char) 13)) line_len--;

Remove that. There's no reason you should have to do any of that if fgetln()
is doing what it's told.

DESCRIPTION
The fgetln() function returns a pointer to the next line from the stream
referenced by stream. This line is not a C string as it does not end
with a terminating NUL character. The length of the line, including the
final newline, is stored in the memory location to which len points.
(Note, however, that if the line is the last in a file that does not end
in a newline, the returned text will not contain a newline.)

Looks like some BSD4.4 function. Also looks like it operates on a FILE stream
and uses a static pointer of some sort. In short, please just avoid this
function altogether and use fgets().
char *line = malloc(line_len + 1);

strncpy(line, buf, line_len);

line[line_len] = (char) 0;

return line;
}

Right. So the issue here is that you don't know your maximum line length which
is probably what led you to find a function like fgetln() in the first place.
This is one area where you've got to either establish a reasonable boundary
size and use that as the size of your temporary buffer or use fread() and do
buffer management yourself. What this means is that if you do not forsee any
line being longer than let's say 1024 characters. Use a simple temporary
buffer of 1024 char, and throw an error when you hit max line length.
int main(int argc, char **argv) {
regex_t regex;
int errc = regcomp(&regex, "^([a-zA-Z_]{3} +[0-9]{1,2} +[0-9]{1,2}:
[0-9]{1,2}:[0-9]{1,2}) +([^ ]+)-([0-9]+)-([a-zA-Z_]+)\\[([0-9]+)\\] +
([^ ]+)( +(([^ ]+)\\(([0-9]*)\\)))?: (.*)\\n?$", REG_EXTENDED |
REG_ICASE);

Add REG_NEWLINE.
Please remove "\\n?" from your regex.

Also, your regex:
^<0 or 1 matches of a formatted string>: <0 or more chars>$
is not the most efficient use of regex, and you should probably examine your
logfile format as well.

However the problem to me is that you did not set REG_NEWLINE.
 
I

igor.kulkin

Just in case someone faces the same problem it seems like I've been
able to find the answer for the question "why is Unix's regex from
libc so slow".

The answer is:
Because it is really so slow. It seems like it is just a very slow
implementation of the library (and by "very slow" I mean "very very
very slow". It is several hundreds times slower than a normal
implementation.).
What I did is I simply replaced this library with a POSIX version of
PCRE (Perl Compatible Regular Expression) library (which is btw
completely free and can be used in commercial projects). That alone
gave me a dramatic performance gain.

I've also heard that there is a GNU implementation of regex library
which is even faster than PCRE, but it's under GPL.
 
C

CBFalconer

.... snip ...

I've also heard that there is a GNU implementation of regex
library which is even faster than PCRE, but it's under GPL.

So? That doesn't prevent you from using it. It doesn't even make
you publish your source code. It does make you publish linkable
binaries, so that improved libraries can be linked. And that only
if you publish/sell your code. You don't even lose your copyright.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
M

matevzb

Just in case someone faces the same problem it seems like I've been
able to find the answer for the question "why is Unix's regex from
libc so slow".

The answer is:
Because it is really so slow. It seems like it is just a very slow
implementation of the library (and by "very slow" I mean "very very
very slow". It is several hundreds times slower than a normal
implementation.).
Although this is heavily off-topic, I'm still interested in the
particular Unix flavour you're using. The reason I'm asking is your
statement "I guess the version should not really matter", which is an
underestimation. True, they differed more in the past, but it's still
a long road to SUS/POSIX.
What I did is I simply replaced this library with a POSIX version of
PCRE (Perl Compatible Regular Expression) library (which is btw
completely free and can be used in commercial projects). That alone
gave me a dramatic performance gain.
PCRE isn't a POSIX library, it only implements a "set of wrapper
functions that correspond to the POSIX regular expression API".
I've also heard that there is a GNU implementation of regex library
which is even faster than PCRE, but it's under GPL.
That shouldn't be a problem, should it?
 
W

William Ahern

On Fri, 16 Feb 2007 02:36:46 -0800, igor.kulkin wrote:
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
I am running it all under Unix (I guess the version should not really
matter) and am using gcc with -O2 to compile C code.

300x difference likely means there's several things going on here at
once. You should take this over to comp.unix.programmer. But, FWIW,
it's well known (by those who care to know) that many/most POSIX/SUS regex
implementations are significantly slower than both Perl's
regex implementation and also the PCRE library (which I assume Python is
using).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top