more than 100 capturing groups in a regex

Discussion in 'Python' started by Joerg Schuster, Oct 24, 2005.

  1. Hello,

    Python regular expressions must not have more than 100 capturing
    groups. The source code responsible for this reads as follows:


    # XXX: <fl> get rid of this limitation!
    if p.pattern.groups > 100:
    raise AssertionError(
    "sorry, but this version only supports 100 named groups"
    )

    I have been waiting a long time now for Python to get rid of this
    limitation.
    I could make a program of mine a lot faster with an easy hack if Python
    did not have it.

    My question is: Does anyone know if the problem is going to be fixed in
    the next few months or so? Or is there a way to circumvent it?


    Jörg Schuster
     
    Joerg Schuster, Oct 24, 2005
    #1
    1. Advertising

  2. Joerg Schuster

    Guest

    Joerg> Or is there a way to circumvent [capturing groups limitation]?

    Sure, submit a patch to SourceForge that removes the restriction.

    I've never come anywhere close to creating regular expressions that need to
    capture 100 groups even though I generate regular expressions from a
    higher-level representation. I suspect few will have hit that limit.
    Perhaps explain what motivates you to want to capture that many groups.
    Other people may be able to suggest alternatives. And remember:

    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems. --Jamie Zawinski

    Skip
     
    , Oct 24, 2005
    #2
    1. Advertising

  3. > Some people, when confronted with a problem, think "I know,
    > I'll use regular expressions." Now they have two problems.
    > --Jamie Zawinski


    Thanks for the citation.

    If my goal had been to redesign my program, I would not ask questions
    about regular expressions. I do not have the time to redesign my
    program. And knowing that my situation would be better, if I had
    written other code in the past, does not help me at all.

    I just want to use more than 100 capturing groups. If someone told me
    that it is very unlikely for Python to get rid of the said limitation,
    I would recode part of my program in C++ using pcre. But I would prefer
    to be able to do everything in Python. That is why I asked.

    Jörg
     
    Joerg Schuster, Oct 25, 2005
    #3
  4. Joerg Schuster wrote:

    > I just want to use more than 100 capturing groups.


    define "more" (101, 200, 1000, 100000, ... ?)

    </F>
     
    Fredrik Lundh, Oct 25, 2005
    #4
  5. Joerg Schuster

    Peter Hansen Guest

    Joerg Schuster wrote:
    > I just want to use more than 100 capturing groups. If someone told me
    > that it is very unlikely for Python to get rid of the said limitation,
    > I would recode part of my program in C++ using pcre.


    It is very unlikely for Python to get rid of the said limitation.

    -Peter
     
    Peter Hansen, Oct 25, 2005
    #5
  6. "Joerg Schuster" <> wrote in message
    news:...
    Hello,

    Python regular expressions must not have more than 100 capturing
    groups.

    Really ??

    I have been waiting a long time now for Python to get rid of this
    limitation.

    Ahh - The "dark side" of Open Source:

    If nobody cares, then you will have to do it yourself (and often people do
    not care because nobody had the need to go there - for good reasons).

    My question is: Does anyone know if the problem is going to be fixed in
    the next few months or so? Or is there a way to circumvent it?

    After a quick glean across the source code for the sre module, it appears
    that the only place the code mentions a limit of 100 groups is in fact the
    place that you quote.

    I suspect it is there for some historic reason - the people to ask is of
    course "pythonware.com" who wrote it; there may well be a good reason for
    the limitation.

    What happens if you up the limit to whatever you need?
     
    Frithiof Andreas Jensen, Oct 25, 2005
    #6
  7. "Joerg Schuster" <> wrote in message
    news:...
    > Some people, when confronted with a problem, think "I know,
    > I'll use regular expressions." Now they have two problems.
    > --Jamie Zawinski


    Thanks for the citation.

    If my goal had been to redesign my program, I would not ask questions
    about regular expressions. I do not have the time to redesign my
    program. And knowing that my situation would be better, if I had
    written other code in the past, does not help me at all.

    Experience shows that, in any project, there is always time redo what was
    not made properly the first time a deadline was dreamed up.

    I just want to use more than 100 capturing groups. If someone told me
    that it is very unlikely for Python to get rid of the said limitation,
    I would recode part of my program in C++ using pcre.

    See!? There *is* Time.
     
    Frithiof Andreas Jensen, Oct 25, 2005
    #7
  8. > What happens if you up the limit to whatever you need?

    Good idea. I just tried this. Nothing evil seems to happen. This seems
    to be a solution. Thanks.

    Jörg
     
    Joerg Schuster, Oct 25, 2005
    #8
  9. No limitation at all would be best. If a limitation is necessary, then
    the more capturing groups, the better. At the time being, I would be
    really happy about having the possibility to use 10000 capturing
    groups.

    Jörg
     
    Joerg Schuster, Oct 25, 2005
    #9
  10. Joerg Schuster

    Jorge Godoy Guest

    "Joerg Schuster" <> writes:

    > No limitation at all would be best. If a limitation is necessary, then
    > the more capturing groups, the better. At the time being, I would be
    > really happy about having the possibility to use 10000 capturing
    > groups.


    I'm sorry, I missed the beginning of this thread and it has already expired on
    my news server, but what is the reason for so much capturing groups? I
    imagine that coding this and keeping code maintenable is a huge effort. Oh,
    and I came from Perl, where I used to think in regexps... In Python I almost
    never use them.

    --
    Jorge Godoy <>
     
    Jorge Godoy, Oct 25, 2005
    #10
  11. Joerg Schuster

    Steve Holden Guest

    Joerg Schuster wrote:
    >>What happens if you up the limit to whatever you need?

    >
    >
    > Good idea. I just tried this. Nothing evil seems to happen. This seems
    > to be a solution. Thanks.
    >
    > Jörg
    >

    The joys of open source. Just remember you have now made your program
    non-portable. Hope this isn't an issue.

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
     
    Steve Holden, Oct 25, 2005
    #11
  12. > but what is the reason for so much capturing groups? I
    > imagine that coding this and keeping code maintenable is a huge effort.


    User input is compiled to regular expressions. The user does not have
    to worry about those groups.
     
    Joerg Schuster, Oct 25, 2005
    #12
  13. On Tue, 25 Oct 2005 03:55:17 -0700, Joerg Schuster wrote:

    > No limitation at all would be best. If a limitation is necessary, then
    > the more capturing groups, the better. At the time being, I would be
    > really happy about having the possibility to use 10000 capturing
    > groups.


    Do you really think that the regular expression needed to do that would be
    maintainable?

    I'm also curious, what sort of usage case would need ten thousand
    capturing groups? I'd love to see the performance, especially if all ten
    thousand of them do backtracking.

    --
    Steven.
     
    Steven D'Aprano, Oct 25, 2005
    #13
  14. > The joys of open source. Just remember you have now
    > made your program
    > non-portable. Hope this isn't an issue.


    Of course portability is an issue -- on the long run. But on the short
    run I am really glad to be able to do a 1 second demo run on my
    notebook instead of a 20 seconds demo run. And I am especially glad to
    get this 1 second demo by doing a 30-minute hack. (Hopefully ...)
     
    Joerg Schuster, Oct 25, 2005
    #14
  15. Joerg Schuster

    Iain King Guest

    Iain King, Oct 25, 2005
    #15
  16. Joerg Schuster

    Jorge Godoy Guest

    "Joerg Schuster" <> writes:

    >> but what is the reason for so much capturing groups? I
    >> imagine that coding this and keeping code maintenable is a huge effort.

    >
    > User input is compiled to regular expressions. The user does not have
    > to worry about those groups.


    And what is the problem with something like getopt, optparse, etc.?

    And 10K groups on a single input line?

    --
    Jorge Godoy <>
     
    Jorge Godoy, Oct 25, 2005
    #16
  17. On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:

    >
    > Fredrik Lundh wrote:
    >> Joerg Schuster wrote:
    >>
    >> > I just want to use more than 100 capturing groups.

    >>
    >> define "more" (101, 200, 1000, 100000, ... ?)
    >>
    >> </F>

    >
    > The Zero-One-Infinity Rule:
    >
    > http://www.catb.org/~esr/jargon/html/Z/Zero-One-Infinity-Rule.html



    Nice in principle, not always practical. Sometimes the choice is, "do you
    want it today with arbitrary limits, or six months from now with bugs
    but no limits?"

    If assigning arbitrary limits prevents worse problems, well, then go for
    the limit. For instance, anyone who has fought browser pops ups ("close
    one window, and ten more open") may have wished that the browser
    implemented an arbitrary limit of, say, ten pop ups. Or even zero :)

    Of course, having said that, in those cases, the limit isn't really
    arbitrary. In cases of genuinely arbitrary limits, I agree they are
    pointless and annoying (as opposed to having a point but still being
    annoying).


    --
    Steven.
     
    Steven D'Aprano, Oct 25, 2005
    #17
  18. You did not quite understand me. I will give you some details:

    My program is a compiler for a certain type of linguistic grammars.
    I.e. the user gives *grammar files* to my program. When the grammar
    files have been compiled, they can be applied to strings (of a certain
    language, e.g. English).

    In the grammar files, the user does not have to deal with "capturing
    groups". He even does not have to deal with regular expressions. He
    just writes a grammar of a certain type. My program then compiles the
    grammar into a cascade of transducers. Each of the transducers is
    internally represented as a pair (REGEX, ACTION), where REGEX is a
    Python regular expression and ACTION a Python function. I.e.: The
    meaning of the grammar is: For each line of the input string: if REGEX
    matches the line, then apply ACTION to it.

    On various levels, the user may produce *disjunctions*. At the time
    being, these disjunctions are internally represented by something like:

    if regex1: action1()
    elif regex2: action2()
    elif ...
    eliif regexn: actionn()

    It would be nicer (and faster) to have just one regex and run that
    action A such that the *capturing group* with name A ("?P<A>...")
    matched.

    Now, I could of course internally create my very own transducers. But
    the re module is a module that generates fsa and fsa do part of the
    work that a transducer does. So why reinvent the wheel?


    Jörg
     
    Joerg Schuster, Oct 25, 2005
    #18
  19. Joerg Schuster

    Iain King Guest

    Steven D'Aprano wrote:
    > On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:
    >
    > >
    > > Fredrik Lundh wrote:
    > >> Joerg Schuster wrote:
    > >>
    > >> > I just want to use more than 100 capturing groups.
    > >>
    > >> define "more" (101, 200, 1000, 100000, ... ?)
    > >>
    > >> </F>

    > >
    > > The Zero-One-Infinity Rule:
    > >
    > > http://www.catb.org/~esr/jargon/html/Z/Zero-One-Infinity-Rule.html

    >
    >
    > Nice in principle, not always practical. Sometimes the choice is, "do you
    > want it today with arbitrary limits, or six months from now with bugs
    > but no limits?"
    >
    > If assigning arbitrary limits prevents worse problems, well, then go for
    > the limit. For instance, anyone who has fought browser pops ups ("close
    > one window, and ten more open") may have wished that the browser
    > implemented an arbitrary limit of, say, ten pop ups. Or even zero :)
    >


    Well, exactly. Why limit to ten? The user is either going to want to
    see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
    this is what most browsers do).
    Note the jargon entry defines infinity in this case to me the largest
    possible amount given whatever ram/disk space/processing power you have
    available.

    Also: These arbitrary limits tend to stem from languages which
    predominantly use fixed size arrays - DIM in basic, or malloc in C.
    The native python component is the list, which doesn't have a max size,
    so these problems should be encountered less in python code; by it's
    nature, python steers you away from this mistake.

    Iain
     
    Iain King, Oct 25, 2005
    #19
  20. On Tue, 25 Oct 2005 06:30:35 -0700, Iain King wrote:

    >
    > Steven D'Aprano wrote:
    >> On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:
    >>
    >> >
    >> > Fredrik Lundh wrote:
    >> >> Joerg Schuster wrote:
    >> >>
    >> >> > I just want to use more than 100 capturing groups.
    >> >>
    >> >> define "more" (101, 200, 1000, 100000, ... ?)
    >> >>
    >> >> </F>
    >> >
    >> > The Zero-One-Infinity Rule:
    >> >
    >> > http://www.catb.org/~esr/jargon/html/Z/Zero-One-Infinity-Rule.html

    >>
    >>
    >> Nice in principle, not always practical. Sometimes the choice is, "do you
    >> want it today with arbitrary limits, or six months from now with bugs
    >> but no limits?"
    >>
    >> If assigning arbitrary limits prevents worse problems, well, then go for
    >> the limit. For instance, anyone who has fought browser pops ups ("close
    >> one window, and ten more open") may have wished that the browser
    >> implemented an arbitrary limit of, say, ten pop ups. Or even zero :)
    >>

    >
    > Well, exactly. Why limit to ten? The user is either going to want to
    > see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
    > this is what most browsers do).


    I haven't been troubled by exponentially increasing numbers of pop up
    windows for a long, long time. But consider your question "why limit to
    ten?" in a wider context.

    Elevators always have a weight limit: the lift will operate up to N
    kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
    since that value of N is well below the breaking point of the elevator
    cables. Lift engineers, I'm told, use a safety factor of 10 (if the
    cable will carry X kg without breaking, set N = X/10). This safety
    factor is obviously arbitrary: a more cautious engineer might use a
    factor of 100, or even 1000, while another might choose a factor of 5 or
    2 or even 1.1. If engineers followed your advice, they would build lifts
    that either carried nothing at all, or accepted as much weight until the
    cable stretched and snapped.

    Perhaps computer programmers would have fewer buffer overflow security
    exploits if they took a leaf out of engineers' book and built in a few
    more arbitrary safety factors into their data-handling routines. We can
    argue whether 256 bytes is long enough for a URL or not, but I think we
    can all agree that 3 MB for a URL is more than any person needs.

    When you are creating an iterative solution to a problem, the ending
    condition is not always well-specified. Trig functions such as sine and
    cosine are an easy case: although they theoretically require an infinite
    number of terms to generate an exact answer, the terms will eventually
    underflow to zero allowing us to stop the calculation.

    But unfortunately that isn't the case for all mathematical calculations.
    Often, the terms of our sequence do not converge to zero, due to round-off
    error. Our answer cycles backwards and forwards between two or more
    floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
    must make an arbitrary choice to stop after N iterations, if the answer
    has not converged. Zero iterations is clearly pointless. One is useless.
    And infinite iterations will simply never return an answer. So an
    arbitrary choice for N is the only sensible way out.

    In a database, we might like to associate (say) multiple phone numbers
    with a single account. Most good databases will allow you to do that, but
    there is still the question of how to collect that information: you have
    to provide some sort of user interface. Now, perhaps you are willing to
    build some sort of web-based front-end that allows the user to add new
    fields, put their phone number in the new field, with no limit. But
    perhaps you are also collecting data using paper forms. So you make an
    arbitrary choice: leave two (or three, or ten) boxes for phone numbers.

    There are many other reasons why you might decide rationally to impose an
    arbitrary limit on some process -- arbitrary does not necessarily mean
    "for no good reason". Just make sure that the reason is a good one.


    --
    Steven.
     
    Steven D'Aprano, Oct 25, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Charles Shannon Hendrix

    REGEX: capturing on optional groups which fail

    Charles Shannon Hendrix, Jun 14, 2004, in forum: Perl
    Replies:
    0
    Views:
    554
    Charles Shannon Hendrix
    Jun 14, 2004
  2. Replies:
    8
    Views:
    6,696
    Neredbojias
    Dec 9, 2005
  3. pinkisntwell

    OT: GNU regex library and non-capturing groups

    pinkisntwell, Nov 13, 2009, in forum: C Programming
    Replies:
    1
    Views:
    376
    Seebs
    Nov 13, 2009
  4. fred
    Replies:
    3
    Views:
    284
    Zifud
    Mar 17, 2005
  5. Replies:
    5
    Views:
    891
Loading...

Share This Page