Threads not Improving Performance in Program

R

Ryan Rosario

I have a parser that needs to process 7 million files. After running
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.

My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:

mythreads = []
for directory in dirList:
#some processing...
for file in fileList:
p = Process(currDir,directory,file) #class that extends thread.Threading
mythreads.append(p)
p.start()

for thread in mythreads:
thread.join()
del thread

The actual class that extends threading.thread is below:

class Process(threading.Thread):
vlock = threading.Lock()
def __init__(self,currDir,directory,file): #thread constructor
threading.Thread.__init__(self)
self.currDir = currDir
self.directory = directory
self.file = file
def run(self):
redirect = re.compile(r'#REDIRECT',re.I)
xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
try:
markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
except:
#An error occurred
Process.vlock.acquire()
BAD = open("bad.log","a")
BAD.writelines(self.file + "\n")
BAD.close()
Process.vlock.release()
print "Error."
return
#if successful, do more processing...


I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?
 
O

odeits

I have a parser that needs to process 7 million files. After running
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.

My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:

mythreads = []
for directory in dirList:
 #some processing...
 for file in fileList:
    p = Process(currDir,directory,file)    #class that extends thread.Threading
    mythreads.append(p)
    p.start()

for thread in mythreads:
 thread.join()
 del thread

The actual class that extends threading.thread is below:

class Process(threading.Thread):
        vlock = threading.Lock()
        def __init__(self,currDir,directory,file):      #thread constructor
                threading.Thread.__init__(self)
                self.currDir = currDir
                self.directory = directory
                self.file = file
        def run(self):
                redirect = re.compile(r'#REDIRECT',re.I)
                xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
                try:
                        markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0­].data
                except:
                        #An error occurred
                        Process.vlock.acquire()
                        BAD = open("bad.log","a")
                        BAD.writelines(self.file + "\n")
                        BAD.close()
                        Process.vlock.release()
                        print "Error."
                        return
                #if successful, do more processing...

I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?

Perhabs the bottleneck is the IO. How big are the files you are trying
to parse? Another possible bottleneck is that threads share memory.
Thread construction is also expensive in python, try looking up a
threadPool class. (ThreadPools are collections of threads that do
work, when they finish they go to an idle state untill you give them
more work, that way you aren't constantly creating new threads which
is expensive)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top