more than 100 capturing groups in a regex

J

Joerg Schuster

Hello,

Python regular expressions must not have more than 100 capturing
groups. The source code responsible for this reads as follows:


# XXX: <fl> get rid of this limitation!
if p.pattern.groups > 100:
raise AssertionError(
"sorry, but this version only supports 100 named groups"
)

I have been waiting a long time now for Python to get rid of this
limitation.
I could make a program of mine a lot faster with an easy hack if Python
did not have it.

My question is: Does anyone know if the problem is going to be fixed in
the next few months or so? Or is there a way to circumvent it?


Jörg Schuster
 
S

skip

Joerg> Or is there a way to circumvent [capturing groups limitation]?

Sure, submit a patch to SourceForge that removes the restriction.

I've never come anywhere close to creating regular expressions that need to
capture 100 groups even though I generate regular expressions from a
higher-level representation. I suspect few will have hit that limit.
Perhaps explain what motivates you to want to capture that many groups.
Other people may be able to suggest alternatives. And remember:

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. --Jamie Zawinski

Skip
 
J

Joerg Schuster

Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.
--Jamie Zawinski

Thanks for the citation.

If my goal had been to redesign my program, I would not ask questions
about regular expressions. I do not have the time to redesign my
program. And knowing that my situation would be better, if I had
written other code in the past, does not help me at all.

I just want to use more than 100 capturing groups. If someone told me
that it is very unlikely for Python to get rid of the said limitation,
I would recode part of my program in C++ using pcre. But I would prefer
to be able to do everything in Python. That is why I asked.

Jörg
 
P

Peter Hansen

Joerg said:
I just want to use more than 100 capturing groups. If someone told me
that it is very unlikely for Python to get rid of the said limitation,
I would recode part of my program in C++ using pcre.

It is very unlikely for Python to get rid of the said limitation.

-Peter
 
F

Frithiof Andreas Jensen

Hello,

Python regular expressions must not have more than 100 capturing
groups.

Really ??

I have been waiting a long time now for Python to get rid of this
limitation.

Ahh - The "dark side" of Open Source:

If nobody cares, then you will have to do it yourself (and often people do
not care because nobody had the need to go there - for good reasons).

My question is: Does anyone know if the problem is going to be fixed in
the next few months or so? Or is there a way to circumvent it?

After a quick glean across the source code for the sre module, it appears
that the only place the code mentions a limit of 100 groups is in fact the
place that you quote.

I suspect it is there for some historic reason - the people to ask is of
course "pythonware.com" who wrote it; there may well be a good reason for
the limitation.

What happens if you up the limit to whatever you need?
 
F

Frithiof Andreas Jensen

Joerg Schuster said:
Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.
--Jamie Zawinski

Thanks for the citation.

If my goal had been to redesign my program, I would not ask questions
about regular expressions. I do not have the time to redesign my
program. And knowing that my situation would be better, if I had
written other code in the past, does not help me at all.

Experience shows that, in any project, there is always time redo what was
not made properly the first time a deadline was dreamed up.

I just want to use more than 100 capturing groups. If someone told me
that it is very unlikely for Python to get rid of the said limitation,
I would recode part of my program in C++ using pcre.

See!? There *is* Time.
 
J

Joerg Schuster

What happens if you up the limit to whatever you need?

Good idea. I just tried this. Nothing evil seems to happen. This seems
to be a solution. Thanks.

Jörg
 
J

Joerg Schuster

No limitation at all would be best. If a limitation is necessary, then
the more capturing groups, the better. At the time being, I would be
really happy about having the possibility to use 10000 capturing
groups.

Jörg
 
J

Jorge Godoy

Joerg Schuster said:
No limitation at all would be best. If a limitation is necessary, then
the more capturing groups, the better. At the time being, I would be
really happy about having the possibility to use 10000 capturing
groups.

I'm sorry, I missed the beginning of this thread and it has already expired on
my news server, but what is the reason for so much capturing groups? I
imagine that coding this and keeping code maintenable is a huge effort. Oh,
and I came from Perl, where I used to think in regexps... In Python I almost
never use them.
 
S

Steve Holden

Joerg said:
Good idea. I just tried this. Nothing evil seems to happen. This seems
to be a solution. Thanks.

Jörg
The joys of open source. Just remember you have now made your program
non-portable. Hope this isn't an issue.

regards
Steve
 
J

Joerg Schuster

but what is the reason for so much capturing groups? I
imagine that coding this and keeping code maintenable is a huge effort.

User input is compiled to regular expressions. The user does not have
to worry about those groups.
 
S

Steven D'Aprano

No limitation at all would be best. If a limitation is necessary, then
the more capturing groups, the better. At the time being, I would be
really happy about having the possibility to use 10000 capturing
groups.

Do you really think that the regular expression needed to do that would be
maintainable?

I'm also curious, what sort of usage case would need ten thousand
capturing groups? I'd love to see the performance, especially if all ten
thousand of them do backtracking.
 
J

Joerg Schuster

The joys of open source. Just remember you have now
made your program
non-portable. Hope this isn't an issue.

Of course portability is an issue -- on the long run. But on the short
run I am really glad to be able to do a 1 second demo run on my
notebook instead of a 20 seconds demo run. And I am especially glad to
get this 1 second demo by doing a 30-minute hack. (Hopefully ...)
 
J

Jorge Godoy

Joerg Schuster said:
User input is compiled to regular expressions. The user does not have
to worry about those groups.

And what is the problem with something like getopt, optparse, etc.?

And 10K groups on a single input line?
 
S

Steven D'Aprano



Nice in principle, not always practical. Sometimes the choice is, "do you
want it today with arbitrary limits, or six months from now with bugs
but no limits?"

If assigning arbitrary limits prevents worse problems, well, then go for
the limit. For instance, anyone who has fought browser pops ups ("close
one window, and ten more open") may have wished that the browser
implemented an arbitrary limit of, say, ten pop ups. Or even zero :)

Of course, having said that, in those cases, the limit isn't really
arbitrary. In cases of genuinely arbitrary limits, I agree they are
pointless and annoying (as opposed to having a point but still being
annoying).
 
J

Joerg Schuster

You did not quite understand me. I will give you some details:

My program is a compiler for a certain type of linguistic grammars.
I.e. the user gives *grammar files* to my program. When the grammar
files have been compiled, they can be applied to strings (of a certain
language, e.g. English).

In the grammar files, the user does not have to deal with "capturing
groups". He even does not have to deal with regular expressions. He
just writes a grammar of a certain type. My program then compiles the
grammar into a cascade of transducers. Each of the transducers is
internally represented as a pair (REGEX, ACTION), where REGEX is a
Python regular expression and ACTION a Python function. I.e.: The
meaning of the grammar is: For each line of the input string: if REGEX
matches the line, then apply ACTION to it.

On various levels, the user may produce *disjunctions*. At the time
being, these disjunctions are internally represented by something like:

if regex1: action1()
elif regex2: action2()
elif ...
eliif regexn: actionn()

It would be nicer (and faster) to have just one regex and run that
action A such that the *capturing group* with name A ("?P<A>...")
matched.

Now, I could of course internally create my very own transducers. But
the re module is a module that generates fsa and fsa do part of the
work that a transducer does. So why reinvent the wheel?


Jörg
 
I

Iain King

Steven said:
Nice in principle, not always practical. Sometimes the choice is, "do you
want it today with arbitrary limits, or six months from now with bugs
but no limits?"

If assigning arbitrary limits prevents worse problems, well, then go for
the limit. For instance, anyone who has fought browser pops ups ("close
one window, and ten more open") may have wished that the browser
implemented an arbitrary limit of, say, ten pop ups. Or even zero :)

Well, exactly. Why limit to ten? The user is either going to want to
see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
this is what most browsers do).
Note the jargon entry defines infinity in this case to me the largest
possible amount given whatever ram/disk space/processing power you have
available.

Also: These arbitrary limits tend to stem from languages which
predominantly use fixed size arrays - DIM in basic, or malloc in C.
The native python component is the list, which doesn't have a max size,
so these problems should be encountered less in python code; by it's
nature, python steers you away from this mistake.

Iain
 
S

Steven D'Aprano

Well, exactly. Why limit to ten? The user is either going to want to
see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
this is what most browsers do).

I haven't been troubled by exponentially increasing numbers of pop up
windows for a long, long time. But consider your question "why limit to
ten?" in a wider context.

Elevators always have a weight limit: the lift will operate up to N
kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
since that value of N is well below the breaking point of the elevator
cables. Lift engineers, I'm told, use a safety factor of 10 (if the
cable will carry X kg without breaking, set N = X/10). This safety
factor is obviously arbitrary: a more cautious engineer might use a
factor of 100, or even 1000, while another might choose a factor of 5 or
2 or even 1.1. If engineers followed your advice, they would build lifts
that either carried nothing at all, or accepted as much weight until the
cable stretched and snapped.

Perhaps computer programmers would have fewer buffer overflow security
exploits if they took a leaf out of engineers' book and built in a few
more arbitrary safety factors into their data-handling routines. We can
argue whether 256 bytes is long enough for a URL or not, but I think we
can all agree that 3 MB for a URL is more than any person needs.

When you are creating an iterative solution to a problem, the ending
condition is not always well-specified. Trig functions such as sine and
cosine are an easy case: although they theoretically require an infinite
number of terms to generate an exact answer, the terms will eventually
underflow to zero allowing us to stop the calculation.

But unfortunately that isn't the case for all mathematical calculations.
Often, the terms of our sequence do not converge to zero, due to round-off
error. Our answer cycles backwards and forwards between two or more
floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
must make an arbitrary choice to stop after N iterations, if the answer
has not converged. Zero iterations is clearly pointless. One is useless.
And infinite iterations will simply never return an answer. So an
arbitrary choice for N is the only sensible way out.

In a database, we might like to associate (say) multiple phone numbers
with a single account. Most good databases will allow you to do that, but
there is still the question of how to collect that information: you have
to provide some sort of user interface. Now, perhaps you are willing to
build some sort of web-based front-end that allows the user to add new
fields, put their phone number in the new field, with no limit. But
perhaps you are also collecting data using paper forms. So you make an
arbitrary choice: leave two (or three, or ten) boxes for phone numbers.

There are many other reasons why you might decide rationally to impose an
arbitrary limit on some process -- arbitrary does not necessarily mean
"for no good reason". Just make sure that the reason is a good one.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top