Freeze problem with Regular Expression

K

Kirk

Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!
 
C

cirfu

Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!


what are you trying to do?
 
R

Reedick, Andrew

-----Original Message-----
From: [email protected] [mailto:python-
[email protected]] On Behalf Of Kirk
Sent: Wednesday, June 25, 2008 11:20 AM
To: (e-mail address removed)
Subject: Freeze problem with Regular Expression

Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-
z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-
9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!


It locks up on 2.5.2 on windows also. Probably too much recursion going
on.


What's with the |'s in [0-9|a-z|\-]? The '|' is a character not an 'or'
operator. I think you meant to say either '[0-9a-z\-]' or '[0-9a-z\-|]'



*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA621
 
M

Maric Michaud

Le Wednesday 25 June 2008 18:40:08 cirfu, vous avez écrit :
Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9
] *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!

what are you trying to do?

This is indeed the good question.

Whatever the implementation/language is, something like that can work with
happiness, but I doubt you'll find one to tell you if it *should* work or if
it shouldn't, my brain-embedded parser is doing some infinite loop too...

That said, "[0-9|a-z|\-]" is by itself strange, pipe (|) between square
brackets is the character '|', so there is no reason for it to appears twice.

Very complicated regexps are always evil, and a two or three stage filtering
is likely to do the job with good, or at least better, readability.

But once more, what are you trying to do ? This is not even clear that regexp
matching is the best tool for it.
 
J

John Machin

Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################
[expletives deleted]

I've python 2.5.2 on Ubuntu 8.04.
any idea?

Several problems:
(1) lose the vertical bars (as advised by others)
(2) ALWAYS use a raw string for regexes; your \s* will match on lower-
case 's', not on spaces
(3) why are you using findall on a pattern that ends in "$"?
(4) using non-verbose regexes of that length means you haven't got a
petrol drum's hope in hell of understanding what's going on
(5) too many variable-length patterns, will take a finite (but very
long) time to evaluate
(6) as remarked by others, you haven't said what you are trying to do;
what it actually is doing doesn't look sensible (see below).

Following code is after fixing problems 1,2,3,4:

C:\junk>type infinitere.py
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
regex0 = r"""
[^A-Z0-9]* # match leading space
(
(?:
[0-9]* # match nothing
[A-Z]+ # match "MSX"
[0-9a-z\-]* # match nothing
)+ # match "MSX"
\s* # match " "
[a-z]* # match nothing
\s* # match nothing
(?:
[0-9]*
[A-Z]+
[0-9a-z\-]*
\s*
)* # match "INTERNATIONAL HOLDINGS ITALIA "
)
([^A-Z]*) # match "srl (di sequito "
"""
regex1 = regex0 + "$"
for rxno, rx in enumerate([regex0, regex1]):
mobj = re.compile(rx, re.VERBOSE).match(text)
if mobj:
print rxno, mobj.groups()
else:
print rxno, "failed"

C:\junk>infinitere.py
0 ('MSX INTERNATIONAL HOLDINGS ITALIA ', 'srl (di seguito ')
### taking a long time, interrupted

HTH,
John
 
J

John Machin

(2) ALWAYS use a raw string for regexes; your \s* will match on lower-
case 's', not on spaces
and should have written:
(2) ALWAYS use a raw string for regexes. <<<=== Big fat full stop
aka period.
but he was at the time only half-way through the first cup of coffee
for the day :)
 
K

Kirk

Several problems:

Ciao John (and All partecipating in this thread),
first of all I'm sorry for the delay but I was out for business.
(1) lose the vertical bars (as advised by others) (2) ALWAYS use a raw
string for regexes; your \s* will match on lower- case 's', not on
spaces

right! thanks!
(3) why are you using findall on a pattern that ends in "$"?

Yes, you are right, I started with a different need and then it changed
over time...
(6) as remarked by others, you haven't said what you are trying to do;

I reply here to all of you about such point: that's not important,
although I appreciate very much your suggestions!
My point was 'something that works in Perl, has problems in Python'.
In respect to this, I thank Peter for his analysis.
Probably Perl has a different pattern matching algorithm.

Thanks again to all of you!

Bye!
 
J

John Machin

Ciao John (and All partecipating in this thread),
first of all I'm sorry for the delay but I was out for business.


right! thanks!


Yes, you are right, I started with a different need and then it changed
over time...


I reply here to all of you about such point: that's not important,
although I appreciate very much your suggestions!
My point was 'something that works in Perl, has problems in Python'.

It *is* important; our point was 'you didn't define "works", and it
was near-impossible (without transcribing your regex into verbose
mode) to guess at what you suppose it might do sometimes'.
 
K

Kirk

It *is* important; our point was 'you didn't define "works", and it was
ok...

near-impossible (without transcribing your regex into verbose mode) to
guess at what you suppose it might do sometimes'.

fine: it's supposed to terminate! :)

Do you think that hanging is an *admissible* behavior? Couldn't we learn
something from Perl implementation?

This is my point.

Bye
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top