Threads not Improving Performance in Program

Ryan Rosario · Mar 19, 2009

I have a parser that needs to process 7 million files. After running
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.

My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:

mythreads = []
for directory in dirList:
#some processing...
for file in fileList:
p = Process(currDir,directory,file) #class that extends thread.Threading
mythreads.append(p)
p.start()

for thread in mythreads:
thread.join()
del thread

The actual class that extends threading.thread is below:

class Process(threading.Thread):
vlock = threading.Lock()
def __init__(self,currDir,directory,file): #thread constructor
threading.Thread.__init__(self)
self.currDir = currDir
self.directory = directory
self.file = file
def run(self):
redirect = re.compile(r'#REDIRECT',re.I)
xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
try:
markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
except:
#An error occurred
Process.vlock.acquire()
BAD = open("bad.log","a")
BAD.writelines(self.file + "\n")
BAD.close()
Process.vlock.release()
print "Error."
return
#if successful, do more processing...

I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?

odeits · Mar 19, 2009

I have a parser that needs to process 7 million files. After running
for 2 days, it had only processed 1.5 million. I want this script to
parse several files at once by using multiple threads: one for each
file currently being analyzed.

My code iterates through all of the directories within a directory,
and at each directory, iterates through each file in that directory. I
structured my code something like this. I think I might be
misunderstanding how to use threads:

mythreads = []
for directory in dirList:
#some processing...
for file in fileList:
p = Process(currDir,directory,file) #class that extends thread.Threading
mythreads.append(p)
p.start()

for thread in mythreads:
thread.join()
del thread

The actual class that extends threading.thread is below:

class Process(threading.Thread):
vlock = threading.Lock()
def __init__(self,currDir,directory,file): #thread constructor
threading.Thread.__init__(self)
self.currDir = currDir
self.directory = directory
self.file = file
def run(self):
redirect = re.compile(r'#REDIRECT',re.I)
xmldoc = minidom.parse(os.path.join(self.currDir,self.file))
try:
markup =
xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data
except:
#An error occurred
Process.vlock.acquire()
BAD = open("bad.log","a")
BAD.writelines(self.file + "\n")
BAD.close()
Process.vlock.release()
print "Error."
return
#if successful, do more processing...

I did an experiment with a variety of numbers of threads and there is
no performance gain. The code is taking the same amount of time to
process 1000 files as it would if the code did not use threads. Any
ideas on what I am doing wrong?

Perhabs the bottleneck is the IO. How big are the files you are trying
to parse? Another possible bottleneck is that threads share memory.
Thread construction is also expensive in python, try looking up a
threadPool class. (ThreadPools are collections of threads that do
work, when they finish they go to an idle state untill you give them
more work, that way you aren't constantly creating new threads which
is expensive)

Problems with ZODB,I can not persist and object accessed from 2 threads	0	Apr 29, 2014
Threads and temporary files	3	Mar 13, 2009
exceptions from daemon threads which access the global namespace atinterpreter shutdown (how to sque	0	Apr 25, 2010
How do i : Python Threads + KeyboardInterrupt exception	7	Jun 19, 2008
Persistent Threads & Synchronisation	3	Nov 26, 2006
multiple threads with Logging: ValueError: I/O operation on closedfile	2	Nov 8, 2008
Threads and signals	1	Nov 15, 2006
MacOS 10.9.2: threading error using python.org 2.7.6 distribution	7	Apr 25, 2014

Threads not Improving Performance in Program

Ryan Rosario

odeits

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads