R
Ryan Rosario
I have a parser that needs to process 7 million files. After running
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.
My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:
mythreads = []
for directory in dirList:
#some processing...
for file in fileList:
p = Process(currDir,directory,file) #class that extends thread.Threading
mythreads.append(p)
p.start()
for thread in mythreads:
thread.join()
del thread
The actual class that extends threading.thread is below:
class Process(threading.Thread):
vlock = threading.Lock()
def __init__(self,currDir,directory,file): #thread constructor
threading.Thread.__init__(self)
self.currDir = currDir
self.directory = directory
self.file = file
def run(self):
redirect = re.compile(r'#REDIRECT',re.I)
xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
try:
markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
except:
#An error occurred
Process.vlock.acquire()
BAD = open("bad.log","a")
BAD.writelines(self.file + "\n")
BAD.close()
Process.vlock.release()
print "Error."
return
#if successful, do more processing...
I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.
My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:
mythreads = []
for directory in dirList:
#some processing...
for file in fileList:
p = Process(currDir,directory,file) #class that extends thread.Threading
mythreads.append(p)
p.start()
for thread in mythreads:
thread.join()
del thread
The actual class that extends threading.thread is below:
class Process(threading.Thread):
vlock = threading.Lock()
def __init__(self,currDir,directory,file): #thread constructor
threading.Thread.__init__(self)
self.currDir = currDir
self.directory = directory
self.file = file
def run(self):
redirect = re.compile(r'#REDIRECT',re.I)
xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
try:
markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
except:
#An error occurred
Process.vlock.acquire()
BAD = open("bad.log","a")
BAD.writelines(self.file + "\n")
BAD.close()
Process.vlock.release()
print "Error."
return
#if successful, do more processing...
I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?