Questions about regex

J

Jared.S.Bauer

Hello,

I'm new to python and I'm having problems with a regular expression. I
use textmate as my editor and when I run the regex in textmate it
works fine, but when I run it as part of the script it freezes. Could
anyone help me figure out why this is happening and how to fix it.
Here is the script:


======================================================
# regular expression search and replace
import sys, os, re, string, csv

#Open the file and taking its data
myfile=open('Steve_query3.csv') #Steve_query_test.csv
#create an error flag to loop the script twice
#store all file's data in the string object 'text'
myfile.seek(0)
text = myfile.read()

for i in range(2):
#def textParse(text, reRun):
print 'how many times is this getting executed', i

#Now to create the newfile 'test' and write our 'text'
newfile = open('Steve_query3_out.csv', 'w')
#open the new file and set it with 'w' for "write"
#loop trough 'text' clean them up and write them into the 'newfile'
#sub( pattern, repl, string[, count])
#"sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'.
text = re.sub('(\<(/?[^\>]+)\>)', "", text)#remove the HTML
text = re.sub('/<!--(.|\s)*?-->/', "", text) #remove comments <!--[^
\-]+-->
text = re.sub('\/\*(.|\s)*?;}', "", text) #remove css formatting
#remove a bunch of word formatting yuck
text = re.sub("&nbsp;", " ", text)
text = re.sub("&lt;", "<", text)
text = re.sub("&gt;", ">", text)
text = re.sub("&quot;|&rquot;|&ldquo;", "\'", text)
#===================================
#The two following lines are the ones giving me the problems
text = re.sub("w:(.|\s)*?\n", "", text)
text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)
#===========================================
text = re.sub(re.compile('^\r?\n?$', re.MULTILINE), '', text) #remove
the extra whitespace
#now write out the new file and close it
newfile.write(text)
newfile.close()

#open the newfile and run the script again
#Open the file and taking its data

myfile=open('Steve_query3_out.csv') #Steve_query_test.csv
#store all file's data in the string object 'text'
myfile.seek(0)
text = myfile.read()

Thanks for the help,

-Jared
 
B

Bobby

Hello,

I'm new to python and I'm having problems with a regular expression. I
use textmate as my editor and when I run the regex in textmate it
works fine, but when I run it as part of the script it freezes. Could
anyone help me figure out why this is happening and how to fix it.
Here is the script:

======================================================
# regular expression search and replace
import sys, os, re, string, csv

#Open the file and taking its data
myfile=open('Steve_query3.csv') #Steve_query_test.csv
#create an error flag  to loop the script twice
#store all file's data in the string object 'text'
myfile.seek(0)
text = myfile.read()

for i in range(2):
        #def textParse(text, reRun):
        print 'how many times is this getting executed', i

        #Now to create the newfile 'test' and write our 'text'
        newfile = open('Steve_query3_out.csv', 'w')
        #open the new file and set it with 'w' for "write"
        #loop trough 'text' clean them up and write them into the 'newfile'
                        #sub(   pattern, repl, string[, count])
                        #"sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'.
        text = re.sub('(\<(/?[^\>]+)\>)', "", text)#remove the HTML
        text = re.sub('/<!--(.|\s)*?-->/', "", text) #remove comments  <!--[^
\-]+-->
        text = re.sub('\/\*(.|\s)*?;}', "", text) #remove css formatting
        #remove a bunch of word formatting yuck
        text = re.sub("&nbsp;", " ", text)
        text = re.sub("&lt;", "<", text)
        text = re.sub("&gt;", ">", text)
        text = re.sub("&quot;|&rquot;|&ldquo;", "\'", text)
#===================================
#The two following lines are the ones giving me the problems
        text = re.sub("w:(.|\s)*?\n", "", text)
        text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)
#===========================================
        text = re.sub(re.compile('^\r?\n?$', re.MULTILINE), '', text) #remove
the extra whitespace
        #now write out the new file and close it
        newfile.write(text)
        newfile.close()

        #open the newfile and run the script again
        #Open the file and taking its data

        myfile=open('Steve_query3_out.csv') #Steve_query_test.csv
        #store all file's data in the string object 'text'
        myfile.seek(0)
        text = myfile.read()

Thanks for the help,

-Jared

Can you give a string that you would expect the regex to match and
what the expected result would be? Currently, it looks like the
interesting part of the regex (.|\s)*? would match any character of
any length once. There seems to be some redundancy that makes it more
confusing then it needs to be. I'm pretty sure that . will also match
anything that \s will match or maybe you just need to escape . because
you meant for it to be a literal.
 
B

bearophileHUGS

Jared.S., even if a regex doesn't look like a program, it's like a
small program written in a strange language. And you have to test and
comment your programs.
So I suggest you to program in a more tidy way, and add unit tests
(doctests may suffice here) to your regexes, you can also use the
verbose mode and comment them, and you can even indent their sub-parts
as pieces of a program.
You must test all your bricks (in python, not in TextMate) before
using them to build something bigger.

Bye,
bearophile
 
S

Steven D'Aprano

Hello,

I'm new to python and I'm having problems with a regular expression. I
use textmate as my editor and when I run the regex in textmate it works
fine, but when I run it as part of the script it freezes. Could anyone
help me figure out why this is happening and how to fix it.


Sure. To figure out why it is happening, the first thing you must do is
figure out *what* is happening. So first you have to isolate the fault:
what part of your script is freezing?

I'm going to assume that it is the regex:
#The two following lines are the ones giving me the problems
text = re.sub("w:(.|\s)*?\n", "", text)
text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)

What happens when you call those two lines in isolation, away from the
rest of your script? (Obviously you need to initialise a value for text.)
Do they still freeze?

For example, I can do this:
'Nobody expects the Spanish Inquisition!'

and it doesn't freeze. It works fine.

I suspect that your problem is that the regex hasn't actually *frozen*,
it's just taking a very, very long time to complete. My guess is that it
probably has something to do with:

(.|\s)*?

This says, "Match any number of, but as few as possible, of any character
or whitespace". This will match newlines as well, so the regular
expression engine will need to do backtracking, which means it will be
slow for large amounts of data. You want to reduce the amount of
backtracking that's needed!

I *guess* that what you probably want is:

w:.*?\n

which will match the letter 'w' followed by ':' followed by the shortest
number of arbitrary characters, including spaces *but not newlines*,
followed by a newline.

The second regex will probably need a similar change made.

But don't take my word for it: I'm not a regex expert. But isolate the
fault, identify when it is happening (for all input data, or only for
large amounts of data?), and then you have a shot at fixing it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top