New implementation of re module

M

MRAB

Hi all,

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

I'm interested in how fast it is generally, compared with the current re
module, but especially when faced with those 'pathological' regular
expressions which seem to take a long time to finish, for example:

re.search(r"^(.+|D)*A$", "x" * 25 + "B")

which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
with this new implementation.

TIA
 
W

William Dode

Hi all,

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

Someone can remember me how to compile it (on debian lenny), if possible
with python2.5. I've also python3.1 that i build alone...

I could test it with pytextile, i've a bunch of texts to bench and
compare.

Did you announce it on the unladen-swallow list ? They wanted to hack on
RE also...
 
M

MRAB

William said:
Someone can remember me how to compile it (on debian lenny), if possible
with python2.5. I've also python3.1 that i build alone...

I could test it with pytextile, i've a bunch of texts to bench and
compare.
All I can do is point you to
http://docs.python.org/extending/extending.html.

For Linux (which I don't have) you'll need to use _regex.h and _regex.c
to compile to _regex.so instead of _regex.pyd.
Did you announce it on the unladen-swallow list ? They wanted to hack on
RE also...
No. I haven't subscribed to it.
 
A

Aahz

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

How does it handle the re module's unit tests?
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 
A

Aahz

Basically, it passes all those tests I expect it to pass. :)

It fails those where the intended behaviour has changed, such as re.sub
treating unmatched groups as empty strings, as requested in
http://bugs.python.org/issue1519638.

Then you should definitely publish to PyPI and post a message to
c.l.py.announce to get more users.
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 
M

Mike

I've been working on a new implementation of the re module.

Fabulous!

If you're extending/changing the interface, there are a couple of sore
points in the current implementation I'd love to see addressed:

- findall/finditer doesn't find overlapping matches. Sometimes you
really *do* want to know all possible matches, even if they overlap.
This comes up in bioinformatics, for example.

- split won't split on empty patterns, e.g. empty lookahead patterns.
This means that it can't be used for a whole class of interesting
cases. This has been discussed previously:

http://bugs.python.org/issue3262
http://bugs.python.org/issue852532
http://bugs.python.org/issue988761

- It'd be nice to have a version of split that generates the parts
(one by one) rather than returning the whole list.

- Repeated subgroup match information is not available. That is, for
a match like this

re.match('(.){3}', 'xyz')

there's no way to discover that the subgroup first matched 'x', then
matched 'y', and finally matched 'z'. Here is one past proposal
(mine), perhaps over-complex, to address this problem:

http://mail.python.org/pipermail/python-dev/2004-August/047238.html

Mike
 
M

MRAB

Mike said:
Fabulous!

If you're extending/changing the interface, there are a couple of sore
points in the current implementation I'd love to see addressed:

- findall/finditer doesn't find overlapping matches. Sometimes you
really *do* want to know all possible matches, even if they overlap.
This comes up in bioinformatics, for example.
Perhaps by adding "overlapped=True"?
- split won't split on empty patterns, e.g. empty lookahead patterns.
This means that it can't be used for a whole class of interesting
cases. This has been discussed previously:

http://bugs.python.org/issue3262
http://bugs.python.org/issue852532
http://bugs.python.org/issue988761
Already addressed (see issue2636 for the full details).
- It'd be nice to have a version of split that generates the parts
(one by one) rather than returning the whole list.
Hmm, re.splititer() perhaps.
- Repeated subgroup match information is not available. That is, for
a match like this

re.match('(.){3}', 'xyz')

there's no way to discover that the subgroup first matched 'x', then
matched 'y', and finally matched 'z'. Here is one past proposal
(mine), perhaps over-complex, to address this problem:

http://mail.python.org/pipermail/python-dev/2004-August/047238.html
Yikes! I think I'll let you code that... :)
 
M

Mike

Perhaps by adding "overlapped=True"?

Something like that would be great, yes.

Already addressed (see issue2636 for the full details).

Glad to hear it.

Yikes! I think I'll let you code that... :)

I agree that that document looks a little scary--maybe I was trying to
bite off too much at once.

My intuition, though, is that the basic idea should be fairly simple
to implement, at least for a depth-first matcher. The repeated match
subgroups are already being discovered, it's just that they're not
being saved, so there's no way to report them out once a complete
match is found. If some trail of breadcrumbs were pushed onto a stack
during the DFS, it could be traced at the end. And the whole thing
might not even been that expensive to do.

The hardest parts about this, in my mind, are figuring out how to
report the repeated matches out in a useful form (hence all that
detail in the proposal), and getting users to understand that using
this feature *could* suck up a lot of memory, if they're not careful.

As always, it's possible that my intuition is totally wrong. Plus I'm
not sure how this would work out in the breadth-first case.

Details aside, I would really, really, really like to have a way to
get at the repeated subgroup matches. I write a lot of code that
would be one-liners if this capability existed. Plus, it just plain
burns me that Python is discovering this information but impudently
refuses to tell me what it's found! ;-)
 
P

Piet van Oostrum

MRAB said:
M> Hi all,
M> I've been working on a new implementation of the re module. The details
M> are at http://bugs.python.org/issue2636, specifically from
M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
M> Python 2.6 on Windows if you want to try it out.
M> I'm interested in how fast it is generally, compared with the current re
M> module, but especially when faced with those 'pathological' regular
M> expressions which seem to take a long time to finish, for example:
M> re.search(r"^(.+|D)*A$", "x" * 25 + "B")
M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
M> with this new implementation.

Is this version also going to use the Thompson approach?
 
M

MRAB

Piet said:
Is this version also going to use the Thompson approach?

That approach is what inspired the method in the new implementation, but
its speed comes from asking _whether_ there's a match, not _where_
there's a match, so it can ignore the route taken to get the match.

"grep", for example, selects those lines which match a pattern, whereas
the typical Python (and Perl) usage is to extract part of a string or
which part of the string matches.

Trying to support lazy and possessive quantifiers in Thompson's approach
is tricky (at least, I haven't been able to figure it out), and
back-references are right out! :)

There's always the chance that I could come up with some sort of hybrid
scheme, applying different approaches depending on certain conditions.
It already switches from depth-first to breadth-first when it's taken
too many iterations without advancing and merges similar states, which
is why it's so much better with 'pathological' patterns; I just have to
get the depth-first phase to be as fast as the current 're' module.
 
J

John Machin

Hi all,

I've been working on a new implementation of the re module. The details
are athttp://bugs.python.org/issue2636, specifically fromhttp://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

Where/how should we report perceived bugs: On that bugs.python.org
issue? Here? Private e-mail?
 
A

Alex Willmer

Hi all,

I've been working on a new implementation of the re module. The details
are athttp://bugs.python.org/issue2636, specifically fromhttp://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

Firstly Matthew, thank you for all your work on this. It brings some
very nice regex features to Python.

I've used Christopher Arndt's post as a basis and created a package
from you latest upload (issue2636-20090804.zip), which builds for
Python 2.5 and 2.6. I'd like to see this on PyPI, so it's easier to
install the module and your work gets wider exposure. Would this be
alright and would you prefer to have control of the upload, as this is
your work?

Below is the setup.py, the unicodedata_db.h files are taken from the
appropriate branches on svn.python.org

#!/usr/bin/env python

import shutil
import sys
from distutils.core import setup, Extension

MAJOR, MINOR = sys.version_info[:2]

# Copy appropriate unicodedata_db.h, not found in published includes
if (MAJOR, MINOR) == (2, 6):
shutil.copy('Python26/unicodedata_db.h', './')
elif (MAJOR, MINOR) == (2, 5):
shutil.copy('Python25/unicodedata_db.h', './')
else:
print "WARNING: No unicodedata_db.h prepared."

setup(
name='regex',
version='20080804',
description='Alternate regular expression module, to replace re.',
author='Matthew Barnett',
author_email='(e-mail address removed)', # Obsfucated
url='http://bugs.python.org/issue2636',
py_modules = ['regex'],
ext_modules=[Extension('_regex', ['_regex.c'])],
)


Sincerely, Alex Willmer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top