more than 100 capturing groups in a regex

I

Iain King

Steven said:
I haven't been troubled by exponentially increasing numbers of pop up
windows for a long, long time. But consider your question "why limit to
ten?" in a wider context.

Elevators always have a weight limit: the lift will operate up to N
kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
since that value of N is well below the breaking point of the elevator
cables. Lift engineers, I'm told, use a safety factor of 10 (if the
cable will carry X kg without breaking, set N = X/10). This safety
factor is obviously arbitrary: a more cautious engineer might use a
factor of 100, or even 1000, while another might choose a factor of 5 or
2 or even 1.1. If engineers followed your advice, they would build lifts
that either carried nothing at all, or accepted as much weight until the
cable stretched and snapped.

Perhaps computer programmers would have fewer buffer overflow security
exploits if they took a leaf out of engineers' book and built in a few
more arbitrary safety factors into their data-handling routines. We can
argue whether 256 bytes is long enough for a URL or not, but I think we
can all agree that 3 MB for a URL is more than any person needs.

When you are creating an iterative solution to a problem, the ending
condition is not always well-specified. Trig functions such as sine and
cosine are an easy case: although they theoretically require an infinite
number of terms to generate an exact answer, the terms will eventually
underflow to zero allowing us to stop the calculation.

But unfortunately that isn't the case for all mathematical calculations.
Often, the terms of our sequence do not converge to zero, due to round-off
error. Our answer cycles backwards and forwards between two or more
floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
must make an arbitrary choice to stop after N iterations, if the answer
has not converged. Zero iterations is clearly pointless. One is useless.
And infinite iterations will simply never return an answer. So an
arbitrary choice for N is the only sensible way out.

In a database, we might like to associate (say) multiple phone numbers
with a single account. Most good databases will allow you to do that, but
there is still the question of how to collect that information: you have
to provide some sort of user interface. Now, perhaps you are willing to
build some sort of web-based front-end that allows the user to add new
fields, put their phone number in the new field, with no limit. But
perhaps you are also collecting data using paper forms. So you make an
arbitrary choice: leave two (or three, or ten) boxes for phone numbers.

There are many other reasons why you might decide rationally to impose an
arbitrary limit on some process -- arbitrary does not necessarily mean
"for no good reason". Just make sure that the reason is a good one.


I think we are arguing at cross-purposes, mainly because the term'
arbitrary' has snuck in. The actual rule:

"Allow none of foo, one of foo, or any number of foo." A rule of
thumb for software design, which instructs one to not place random
limits on the number of instances of a given entity.

Firstly, 'for software design'. Not for field engineers servicing
elevators :)

Second, it's [random], not [arbitrary]. I took your use of arbitrary
to mean much the same thing - a number picked without any real
judgement involved, simply because it was deemed larger than some
assumed maximum size. The rule does not apply to a number selected for
good reason.

I don't think I get your phone record example: Surely you'd have the
client record in a one-to-many relationship with the phone number
records, so there would be (theoretically) no limit?

Your web interface rang a bell though - in GMails contacts info page,
each contact has info stored in sections. Each of these sections
stores a heading, an address, and some fields. It defaults to two
fields, with an add field button Hitting it a lot I found this maxed
out at 20 fields per section. You can also add more sections though -
I got bored hitting the add section button once I got to 51 sections
with the button still active. I assume there is some limit to the
number of sections, but I don't know what it is :) GMail is awesome.

Anyway, back to the OP: in this specific case, the cap of 100 groups in
a RE seems random to me, so I think the rule applies.

Also, see "C Programmer's Disease":
http://www.catb.org/~esr/jargon/html/C/C-Programmers-Disease.html

Iain
 
F

Fredrik Lundh

Iain said:
Anyway, back to the OP: in this specific case, the cap of 100 groups in
a RE seems random to me, so I think the rule applies.

perhaps in the "indistinguishable from magic" sense.

if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.

</F>
 
J

Joerg Schuster

if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.

I have read everything I found about Python regular expressions. But I
am not able to understand what you mean. What is so special about 99?
 
F

Fredrik Lundh

Joerg said:
I have read everything I found about Python regular expressions. But I
am not able to understand what you mean. What is so special about 99?

it's the largest number than can be written with two decimal digits.

</F>
 
I

Iain King

Fredrik said:
perhaps in the "indistinguishable from magic" sense.

if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.

</F>

Ah, doh! Of course. Oh well then... still, doesn't python's RE
engine support named groups? That would be cumbersome, but would allow
you to go above 100...
 
?

=?ISO-8859-1?Q?Andr=E9?= Malo

* "Iain King said:
Ah, doh! Of course. Oh well then... still, doesn't python's RE
engine support named groups? That would be cumbersome, but would allow
you to go above 100...

The named groups are built on top of numbered captures. They are mapped by the
parser and the match instance's group method. The regex matcher itself never
sees these names.

nd
 
J

Joerg Schuster

My first test program was far too naive. Evil things do happen. Simply
removing the code that restricts the number of capturing groups to 100
is not a solitution.
 
D

D H

Fredrik said:
Joerg Schuster wrote:




it's the largest number than can be written with two decimal digits.


It's a conflict between python's syntax for regex back references and
octal number literals. Probably wasn't noticed until way too late, and
now it will never change.
 
J

Joerg Schuster

It's a conflict between python's syntax for regex back
references and
octal number literals. Probably wasn't noticed until way
too late, and
now it will never change.

So "reasonable choice" is not a really good description of the
phenomenon.
 
S

skip

DH> It's a conflict between python's syntax for regex back references
DH> and octal number literals. Probably wasn't noticed until way too
DH> late, and now it will never change.

I suspect it comes from Perl, since Python's regular expression engine tries
pretty hard to be compatible with Perl's, at least for the basics.

Skip
 
T

Tim Peters

[DH]
It's a conflict between python's syntax for regex back references
and octal number literals. Probably wasn't noticed until way too
ate, and now it will never change.
[[email protected]]
I suspect it comes from Perl, since Python's regular expression engine tries
pretty hard to be compatible with Perl's, at least for the basics.

"No" to all the above <wink>. The limitation to 99 in backreference
notation was thoroughly discussed on the Python String-SIG at the
time, and it was deliberately not bug-compatible with the Perl of that
time.

In the Perl of that time (no idea what's true now), e.g., \123 in a
regexp was an octal escape if it appeared before or within the 123rd
capturing group, but was a backreference to the 123rd capturing group
if it appeared after the 123rd capturing group. So, yes, two
different instances of "\123" in a single regexp could have different
meanings (meaning chr(83) in one place, and a backreference to group
123 in another, and there's no way to tell the difference without
counting the number of preceding capturing groups).

That's so horridly un-Pythonic that we drew the line there. Nobody
had a sane use case for more than 99 backreferences, so "who cares?"
won.

Note that this isn't a reason for limiting the number of capturing
groups. It only accounts for why we didn't care that you couldn't
write a _backreference_ to a capturing group higher than number 99
using "\nnn" notation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top