re.search much slower then grep on some regular expressions

  • Thread starter Henning_Thornblad
  • Start date
S

Sebastian \lunar\ Wiesner

Terry Reedy said:
Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

Yet, to many programmers the Hummer is not only just a car, its _the_ car.
Everything else is just silly, bloody old posix crap ;)

(meaning, that pcres are often seen as true regular expressions, and
everything else is referred to as "being somehow restricted")
 
S

sjdevnull

Apparently, grep and Tcl convert a regex to a finite state machine. ....
But not all PCREs can be converted to a finite state machine ....
Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Formally, there's no lack of agreement that I'm aware of. Anyone
formally defining it will use something along the lines of what
wikipedia specifies (with "Regular expressions in this sense can
express the regular languages, exactly the class of languages accepted
by finite state automata."). Colloquially it's used to mean "any text-
matching scheme that looks vaguely like sed/grep regexes".

PCREs are certainly not "real" regular expressions, but they are
colloquially.
 
M

Mark Wooding

Sebastian "lunar" Wiesner said:
I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.

So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

-- [mdw]
 
S

Sebastian \lunar\ Wiesner

Mark Wooding said:
So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

I don't know about the latter, but perl doesn't use any smart algorithm, it
just heavily relies on memoization to gain speed.

This makes perl perform poorly in other situations, as mentioned in the
paper cited by Mark Dickinson:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

In Python on the other, this pattern works fine, at the cost of performance
losses on other patterns.

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).
Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)
 
T

Terry Reedy

Sebastian said:
I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)

One thing to remember in making comparisons is that Python moved to its
own implementation to support unicode as well as extended ascii (bytes
in 3.0). Does grep do that? Has pcre, which Python once used, been
upgraded to do that?
 
M

Mark Wooding

Sebastian "lunar" Wiesner said:
# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)
It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Stack overflow condition. :-(

-- [mdw]
 
M

Marc 'BlackJack' Rintsch

How come, that you think so?

The '#' is usually the prompt of root accounts while ordinary users get a '$'.

Ciao,
Marc 'BlackJack' Rintsch
 
H

Henning Thornblad

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

/Henning Thornblad
 
K

Kris Kennaway

Paddy said:
Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm ;)


This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

I can and do :)

It's a major problem that regular expression parsing in python has
exponential complexity when polynomial algorithms (for a subset of
regexp expressions, e.g. excluding back-references) are well-known.

It rules out using python for entire classes of applications where
regexp parsing is on the critical path.

Kris
 
J

John Machin

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N**2) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**3)
process.

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John
 
J

John Machin

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John
 
S

samwyse

What can be the cause of the large difference between re.search and
grep?
While doing a simple grep:
grep '[^ "=]*/' input                  (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break
 
K

Kris Kennaway

samwyse said:
What can be the cause of the large difference between re.search and
grep?
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

Very interesting! Thanks very much for the pointer.

Kris
 
K

Kris Kennaway

samwyse said:
What can be the cause of the large difference between re.search and
grep?
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Kris

Kris
 
H

Henning Thornblad

When trying to find an alternative way of solving my problem i found
that running this script:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)
wouldn't take even a second (The re.search part of course)
This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

What we had was a regex that searched textfiles, row for row,
for an absolute path starting/ending with ", = or blank space.

The path have a known signature which is pretty unique and the regex
looked like this:
[^ ="]*/[^ ="]*signature[^ =]*/

So we didn't really see any performance loss until we got an textfile
with some really long rows, one reaching 156,000 characters. We
removed the cause of these long rows.
But also decided to tune the regex making it more scalable.

I have though of two possible soultions:
1. Invert [^ ="] so that it becomes [alot of charcters you could find
in a path]
2. Add [ ="]: [ ="][^ ="]*/[^ ="]*signature[^ =]*/[ ="] and just trim
them later
They both work good performance wise but i can't figure out which one
is the less ugly.

Anyway, I'm glad I asked the question. I have learned more about regex
then i ever thought i'd need :)

Many Thanks
Henning Thornblad
 
J

John Machin

samwyse said:
What can be the cause of the large difference between re.search and
grep?
While doing a simple grep:
grep '[^ "=]*/' input                  (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "
I haven't tested this, but I think it would do what you want:
from Plex import *
lexicon = Lexicon([
    (Rep(AnyBut(' "='))+Str('/'),  TEXT),
    (AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
    token = scanner.read()
    print token
    if token[0] is None:
        break

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute).  If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re? How large is "large"?
What kind of text?

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Cheers,
John
 
K

Kris Kennaway

John said:
Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re?

Trivial stuff like:

(Str('error in pkg_delete'), ('mtree', 'mtree')),
(Str('filesystem was touched prior to .make install'),
('mtree', 'mtree')),
(Str('list of extra files and directories'), ('mtree', 'mtree')),
(Str('list of files present before this port was installed'),
('mtree', 'mtree')),
(Str('list of filesystem changes from before and after'),
('mtree', 'mtree')),

(re('Configuration .* not supported'), ('arch', 'arch')),

(re('(configure: error:|Script.*configure.*failed
unexpectedly|script.*failed: here are the contents of)'),
('configure_error', 'configure')),
....

There are about 150 of them and I want to find which is the first match
in a text file that ranges from a few KB up to 512MB in size.
> How large is "large"?
What kind of text?

It's compiler/build output.
Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Thanks, looks interesting but I don't think it is the best fit here. I
would like to avoid spawning hundreds of processes to process each file
(since I have tens of thousands of them to process).

Kris
 
J

Jeroen Ruigrok van der Werven

-On [20080709 14:08] said:
It's compiler/build output.

Sounds like the FreeBSD ports build cluster. :)

Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.

Also, a while ago I did tests with various GCC compilers and their effect on
Python running time as well as Intel's cc. Intel won on (nearly) all
accounts, meaning it was faster overall.
 
K

Kris Kennaway

Jeroen said:
-On [20080709 14:08] said:
It's compiler/build output.

Sounds like the FreeBSD ports build cluster. :)

Yes indeed!
Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.

I am pretty sure the problem is algorithmic, not bad byte code :) If it
was a matter of a few % then that is in the scope of compiler tweaks,
but we're talking orders of magnitude.

Kris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,281
Latest member
Pedroaciny

Latest Threads

Top