New implementation of re module

MRAB · Jul 27, 2009

Hi all,

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

I'm interested in how fast it is generally, compared with the current re
module, but especially when faced with those 'pathological' regular
expressions which seem to take a long time to finish, for example:

re.search(r"^(.+|D)*A$", "x" * 25 + "B")

which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
with this new implementation.

TIA

William Dode · Jul 27, 2009

Hi all,

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

Someone can remember me how to compile it (on debian lenny), if possible
with python2.5. I've also python3.1 that i build alone...

I could test it with pytextile, i've a bunch of texts to bench and
compare.

Did you announce it on the unladen-swallow list ? They wanted to hack on
RE also...

MRAB · Jul 27, 2009

William said:
Someone can remember me how to compile it (on debian lenny), if possible
with python2.5. I've also python3.1 that i build alone...

I could test it with pytextile, i've a bunch of texts to bench and
compare.

All I can do is point you to
http://docs.python.org/extending/extending.html.

For Linux (which I don't have) you'll need to use _regex.h and _regex.c
to compile to _regex.so instead of _regex.pyd.

Did you announce it on the unladen-swallow list ? They wanted to hack on
RE also...

No. I haven't subscribed to it.

Aahz · Jul 27, 2009

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

How does it handle the re module's unit tests?
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer

OKB (not okblacke) · Jul 28, 2009

MRAB said:
http://bugs.python.org/issue2636#msg90954

Variable-length lookbehind! My hero!

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown

MRAB · Jul 28, 2009

Aahz said:
How does it handle the re module's unit tests?

Basically, it passes all those tests I expect it to pass.

It fails those where the intended behaviour has changed, such as re.sub
treating unmatched groups as empty strings, as requested in
http://bugs.python.org/issue1519638.

Aahz · Jul 28, 2009

Basically, it passes all those tests I expect it to pass.

It fails those where the intended behaviour has changed, such as re.sub
treating unmatched groups as empty strings, as requested in
http://bugs.python.org/issue1519638.

Then you should definitely publish to PyPI and post a message to
c.l.py.announce to get more users.
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer

Mike · Jul 29, 2009

I've been working on a new implementation of the re module.

Fabulous!

If you're extending/changing the interface, there are a couple of sore
points in the current implementation I'd love to see addressed:

- findall/finditer doesn't find overlapping matches. Sometimes you
really *do* want to know all possible matches, even if they overlap.
This comes up in bioinformatics, for example.

- split won't split on empty patterns, e.g. empty lookahead patterns.
This means that it can't be used for a whole class of interesting
cases. This has been discussed previously:

http://bugs.python.org/issue3262
http://bugs.python.org/issue852532
http://bugs.python.org/issue988761

- It'd be nice to have a version of split that generates the parts
(one by one) rather than returning the whole list.

- Repeated subgroup match information is not available. That is, for
a match like this

re.match('(.){3}', 'xyz')

there's no way to discover that the subgroup first matched 'x', then
matched 'y', and finally matched 'z'. Here is one past proposal
(mine), perhaps over-complex, to address this problem:

http://mail.python.org/pipermail/python-dev/2004-August/047238.html

Mike

MRAB · Jul 29, 2009

Mike said:
Fabulous!

If you're extending/changing the interface, there are a couple of sore
points in the current implementation I'd love to see addressed:

- findall/finditer doesn't find overlapping matches. Sometimes you
really *do* want to know all possible matches, even if they overlap.
This comes up in bioinformatics, for example.

Perhaps by adding "overlapped=True"?

- split won't split on empty patterns, e.g. empty lookahead patterns.
This means that it can't be used for a whole class of interesting
cases. This has been discussed previously:

http://bugs.python.org/issue3262
http://bugs.python.org/issue852532
http://bugs.python.org/issue988761

Already addressed (see issue2636 for the full details).

- It'd be nice to have a version of split that generates the parts
(one by one) rather than returning the whole list.

Hmm, re.splititer() perhaps.

- Repeated subgroup match information is not available. That is, for
a match like this

re.match('(.){3}', 'xyz')

there's no way to discover that the subgroup first matched 'x', then
matched 'y', and finally matched 'z'. Here is one past proposal
(mine), perhaps over-complex, to address this problem:

http://mail.python.org/pipermail/python-dev/2004-August/047238.html

Yikes! I think I'll let you code that...

Mike · Jul 29, 2009

Perhaps by adding "overlapped=True"?

Something like that would be great, yes.

Already addressed (see issue2636 for the full details).

Glad to hear it.

Yikes! I think I'll let you code that...

I agree that that document looks a little scary--maybe I was trying to
bite off too much at once.

My intuition, though, is that the basic idea should be fairly simple
to implement, at least for a depth-first matcher. The repeated match
subgroups are already being discovered, it's just that they're not
being saved, so there's no way to report them out once a complete
match is found. If some trail of breadcrumbs were pushed onto a stack
during the DFS, it could be traced at the end. And the whole thing
might not even been that expensive to do.

The hardest parts about this, in my mind, are figuring out how to
report the repeated matches out in a useful form (hence all that
detail in the proposal), and getting users to understand that using
this feature *could* suck up a lot of memory, if they're not careful.

As always, it's possible that my intuition is totally wrong. Plus I'm
not sure how this would work out in the breadth-first case.

Details aside, I would really, really, really like to have a way to
get at the repeated subgroup matches. I write a lot of code that
would be one-liners if this capability existed. Plus, it just plain
burns me that Python is discovering this information but impudently
refuses to tell me what it's found! ;-)

Piet van Oostrum · Jul 30, 2009

MRAB said:
M> Hi all,
M> I've been working on a new implementation of the re module. The details
M> are at http://bugs.python.org/issue2636, specifically from
M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
M> Python 2.6 on Windows if you want to try it out.

M> I'm interested in how fast it is generally, compared with the current re
M> module, but especially when faced with those 'pathological' regular
M> expressions which seem to take a long time to finish, for example:

M> re.search(r"^(.+|D)*A$", "x" * 25 + "B")

M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
M> with this new implementation.

Is this version also going to use the Thompson approach?

MRAB · Jul 30, 2009

Piet said:
Is this version also going to use the Thompson approach?

That approach is what inspired the method in the new implementation, but
its speed comes from asking _whether_ there's a match, not _where_
there's a match, so it can ignore the route taken to get the match.

"grep", for example, selects those lines which match a pattern, whereas
the typical Python (and Perl) usage is to extract part of a string or
which part of the string matches.

Trying to support lazy and possessive quantifiers in Thompson's approach
is tricky (at least, I haven't been able to figure it out), and
back-references are right out!

There's always the chance that I could come up with some sort of hybrid
scheme, applying different approaches depending on certain conditions.
It already switches from depth-first to breadth-first when it's taken
too many iterations without advancing and merges similar states, which
is why it's so much better with 'pathological' patterns; I just have to
get the depth-first phase to be as fast as the current 're' module.

John Machin · Aug 3, 2009

Hi all,

I've been working on a new implementation of the re module. The details
are athttp://bugs.python.org/issue2636, specifically fromhttp://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

Where/how should we report perceived bugs: On that bugs.python.org
issue? Here? Private e-mail?

MRAB · Aug 3, 2009

John said:
Where/how should we report perceived bugs: On that bugs.python.org
issue? Here? Private e-mail?

Probably on http://bugs.python.org/issue2636.

Alex Willmer · Aug 4, 2009

Hi all,

I've been working on a new implementation of the re module. The details
are athttp://bugs.python.org/issue2636, specifically fromhttp://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

Firstly Matthew, thank you for all your work on this. It brings some
very nice regex features to Python.

I've used Christopher Arndt's post as a basis and created a package
from you latest upload (issue2636-20090804.zip), which builds for
Python 2.5 and 2.6. I'd like to see this on PyPI, so it's easier to
install the module and your work gets wider exposure. Would this be
alright and would you prefer to have control of the upload, as this is
your work?

Below is the setup.py, the unicodedata_db.h files are taken from the
appropriate branches on svn.python.org

#!/usr/bin/env python

import shutil
import sys
from distutils.core import setup, Extension

MAJOR, MINOR = sys.version_info[:2]

# Copy appropriate unicodedata_db.h, not found in published includes
if (MAJOR, MINOR) == (2, 6):
shutil.copy('Python26/unicodedata_db.h', './')
elif (MAJOR, MINOR) == (2, 5):
shutil.copy('Python25/unicodedata_db.h', './')
else:
print "WARNING: No unicodedata_db.h prepared."

setup(
name='regex',
version='20080804',
description='Alternate regular expression module, to replace re.',
author='Matthew Barnett',
author_email='(e-mail address removed)', # Obsfucated
url='http://bugs.python.org/issue2636',
py_modules = ['regex'],
ext_modules=[Extension('_regex', ['_regex.c'])],
)

Sincerely, Alex Willmer

ANN: A new version (0.3.2) of the Python module which wraps GnuPG hasbeen released.	0	Jan 17, 2013
ANN: A new version (0.3.4) of the Python module which wraps GnuPG hasbeen released.	0	Jun 5, 2013
ANN: A new version (0.3.3) of the Python module which wraps GnuPG hasbeen released.	0	Mar 11, 2013
ANN: A new version (0.2.8) of the Python module which wraps GnuPG hasbeen released.	0	Sep 2, 2011
Module locale throws exception: unsupported locale setting	2	Nov 19, 2010
ANN: A new version (0.2.7) of the Python module which wraps GnuPG hasbeen released.	0	Apr 10, 2011
ANN: A new version (0.2.5) of the Python module which wraps GnuPG hasbeen released.	0	Oct 14, 2010
How do you print a string after it's been searched for an RE?	4	Jun 23, 2011

New implementation of re module

MRAB

William Dode

MRAB

Aahz

OKB (not okblacke)

MRAB

Aahz

Mike

MRAB

Mike

Piet van Oostrum

MRAB

John Machin

MRAB

Alex Willmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads