how to start thread by group?

O

oyster

my code is not right, can sb give me a hand? thanx

for example, I have 1000 urls to be downloaded, but only 5 thread at one time
def threadTask(ulr):
download(url)

threadsAll=[]
for url in all_url:
task=threading.Thread(target=threadTask, args=)
threadsAll.append(task)

for every5task in groupcount(threadsAll,5):
for everytask in every5task:
everytask.start()

for everytask in every5task:
everytask.join()

for everytask in every5task: #this does not run ok
while everytask.isAlive():
pass
 
B

bieffe62

my code is not right, can sb give me a hand? thanx

for example, I have 1000 urls to be downloaded, but only 5 thread at one time
def threadTask(ulr):
  download(url)

threadsAll=[]
for url in all_url:
     task=threading.Thread(target=threadTask, args=)
     threadsAll.append(task)

for every5task in groupcount(threadsAll,5):
    for everytask in every5task:
        everytask.start()

    for everytask in every5task:
        everytask.join()

    for everytask in every5task:        #this does not run ok
        while everytask.isAlive():
            pass


Thread.join() stops until the thread is finished. You are assuming
that the threads
terminates exactly in the order in which are started. Moreover, before
starting the
next 5 threads you are waiting that all previous 5 threads have been
completed, while I
believe your intention was to have always the full load of 5 threads
downloading.

I would restructure my code with someting like this ( WARNING: the
following code is
ABSOLUTELY UNTESTED and shall be considered only as pseudo-code to
express my idea of
the algorithm (which, also, could be wrong:) ):


import threading, time

MAX_THREADS = 5
DELAY = 0.01 # or whatever

def task_function( url ):
download( url )

def start_thread( url):
task=threading.Thread(target=task_function, args=[url])
return task

def main():
all_urls = load_urls()
all_threads = []
while all_urls:
while len(all_threads) < MAX_THREADS:
url = all_urls.pop(0)
t = start_thread()
all_threads.append(t)
for t in all_threads
if not t.isAlive():
t.join()
all_threads.delete(t)
time.sleep( DELAY )


HTH

Ciao[/QUOTE]
 
G

Gabriel Genellina

I would restructure my code with someting like this ( WARNING: the
following code is
ABSOLUTELY UNTESTED and shall be considered only as pseudo-code to
express my idea of
the algorithm (which, also, could be wrong:) ):

Your code creates one thread per url (but never more than MAX_THREADS
alive at the same time). Usually it's more efficient to create all the
MAX_THREADS at once, and continuously feed them with tasks to be done. A
Queue object is the way to synchronize them; from the documentation:

<code>
from Queue import Queue
from threading import Thread

num_worker_threads = 3
list_of_urls = ["http://foo.com", "http://bar.com",
"http://baz.com", "http://spam.com",
"http://egg.com",
]

def do_work(url):
from time import sleep
from random import randrange
from threading import currentThread
print "%s downloading %s" % (currentThread().getName(), url)
sleep(randrange(5))
print "%s done" % currentThread().getName()

# from this point on, copied almost verbatim from the Queue example
# at the end of http://docs.python.org/library/queue.html

def worker():
while True:
item = q.get()
do_work(item)
q.task_done()

q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.setDaemon(True)
t.start()

for item in list_of_urls:
q.put(item)

q.join() # block until all tasks are done
print "Finished"
</code>
 
L

Lawrence D'Oliveiro

Usually it's more efficient to create all the MAX_THREADS at once, and
continuously feed them with tasks to be done.

Given that the bottleneck is most likely to be the internet connection, I'd
say the "premature optimization is the root of all evil" adage applies
here.
 
T

Terry Reedy

Lawrence said:
Given that the bottleneck is most likely to be the internet connection, I'd
say the "premature optimization is the root of all evil" adage applies
here.

There is also the bottleneck of programmer time to understand, write,
and maintain. In this case, 'more efficient' is simpler, and to me,
more efficient of programmer time. Feeding a fixed pool of worker
threads with a Queue() is a standard design that is easy to understand
and one the OP should learn. Re-using tested code is certainly
efficient of programmer time. Managing a variable pool of workers that
die and need to be replaced is more complex (two loops nested within a
loop) and error prone (though learning that alternative is probably not
a bad idea also).

tjr
 
G

Gabriel Genellina

There is also the bottleneck of programmer time to understand, write,
and maintain. In this case, 'more efficient' is simpler, and to me,
more efficient of programmer time. Feeding a fixed pool of worker
threads with a Queue() is a standard design that is easy to understand
and one the OP should learn. Re-using tested code is certainly
efficient of programmer time. Managing a variable pool of workers that
die and need to be replaced is more complex (two loops nested within a
loop) and error prone (though learning that alternative is probably not
a bad idea also).

I'd like to add that debugging a program that continuously creates and
destroys threads is a real PITA.
 
B

bieffe62

I would restructure my code with someting like this ( WARNING: the
following code is
ABSOLUTELY UNTESTED and shall be considered only as pseudo-code to
express my idea of
the algorithm (which, also, could be wrong:) ):

Your code creates one thread per url (but never more than MAX_THREADS  
alive at the same time). Usually it's more efficient to create all the  
MAX_THREADS at once, and continuously feed them with tasks to be done. A  
Queue object is the way to synchronize them; from the documentation:

<code>
 from Queue import Queue
 from threading import Thread

num_worker_threads = 3
list_of_urls = ["http://foo.com", "http://bar.com",
                 "http://baz.com", "http://spam.com",
                 "http://egg.com",
                ]

def do_work(url):
     from time import sleep
     from random import randrange
     from threading import currentThread
     print "%s downloading %s" % (currentThread().getName(), url)
     sleep(randrange(5))
     print "%s done" % currentThread().getName()

# from this point on, copied almost verbatim from the Queue example
# at the end ofhttp://docs.python.org/library/queue.html

def worker():
     while True:
         item = q.get()
         do_work(item)
         q.task_done()

q = Queue()
for i in range(num_worker_threads):
      t = Thread(target=worker)
      t.setDaemon(True)
      t.start()

for item in list_of_urls:
     q.put(item)

q.join()       # block until all tasks are done
print "Finished"
</code>


Agreed.
I was trying to do what the OP was trying to do, but in a way that
works.
But keeping the thread alive and feeding them the URL is a better
design, definitly.
And no, I don't think its 'premature optimization': it is just
cleaner.

Ciao
 
L

Lawrence D'Oliveiro

I'd like to add that debugging a program that continuously creates and
destroys threads is a real PITA.

That's God trying to tell you to avoid threads altogether.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top