Thread locking question.

grocery_stocker · May 9, 2009

The following code gets data from 5 different websites at the "same
time".

#!/usr/bin/python

import Queue
import threading
import urllib2
import time

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]

queue = Queue.Queue()

class MyUrl(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue

def run(self):
while True:
host = self.queue.get()
if host is None:
break
url = urllib2.urlopen(host)
print url.read(1024)
#self.queue.task_done()

start = time.time()

def main():
for i in range(5):
t = MyUrl(queue)
t.setDaemon(True)
t.start()

for host in hosts:
print "pushing", host
queue.put(host)

for i in range(5):
queue.put(None)

t.join()

if __name__ == "__main__":
main()
print "Elapsed Time: %s" % (time.time() - start)

How does the parallel download work if each thread has a lock? When
the program opens www.yahoo.com, it places a lock on the thread,
right? If so, then doesn't that mean the other 4 sites have to wait
for the thread to release the lock?

Piet van Oostrum · May 9, 2009

grocery_stocker said:
gs> The following code gets data from 5 different websites at the "same
gs> time".

gs> #!/usr/bin/python

gs> import Queue
gs> import threading
gs> import urllib2
gs> import time

gs> hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
gs> "http://ibm.com", "http://apple.com"]

gs> queue = Queue.Queue()

gs> class MyUrl(threading.Thread):
gs> def __init__(self, queue):
gs> threading.Thread.__init__(self)
gs> self.queue = queue

gs> def run(self):
gs> while True:
gs> host = self.queue.get()
gs> if host is None:
gs> break
gs> url = urllib2.urlopen(host)
gs> print url.read(1024)
gs> #self.queue.task_done()

gs> start = time.time()

gs> def main():
gs> for i in range(5):
gs> t = MyUrl(queue)
gs> t.setDaemon(True)
gs> t.start()

gs> for host in hosts:
gs> print "pushing", host
gs> queue.put(host)

gs> for i in range(5):
gs> queue.put(None)

gs> t.join()

gs> if __name__ == "__main__":
gs> main()
gs> print "Elapsed Time: %s" % (time.time() - start)

gs> How does the parallel download work if each thread has a lock? When
gs> the program opens www.yahoo.com, it places a lock on the thread,
gs> right? If so, then doesn't that mean the other 4 sites have to wait
gs> for the thread to release the lock?

No. Where does it set a lock? There is only a short lock period in the queue
when an item is put in the queue or got from the queue. And of course we
have the GIL, but this is released as soon as a long during operation is
started - in this case when the Internet communication is done.

MRAB · May 9, 2009

Piet said:
gs> The following code gets data from 5 different websites at the "same
gs> time". [snip]
gs> start = time.time()

Click to expand...

gs> def main():
gs> for i in range(5):
gs> t = MyUrl(queue)
gs> t.setDaemon(True)
gs> t.start()

Click to expand...

gs> for host in hosts:
gs> print "pushing", host
gs> queue.put(host)

Click to expand...

gs> for i in range(5):
gs> queue.put(None)

Click to expand...

gs> t.join()

Click to expand...

gs> if __name__ == "__main__":
gs> main()
gs> print "Elapsed Time: %s" % (time.time() - start)

gs> How does the parallel download work if each thread has a lock? When
gs> the program opens www.yahoo.com, it places a lock on the thread,
gs> right? If so, then doesn't that mean the other 4 sites have to wait
gs> for the thread to release the lock?

Click to expand...

No. Where does it set a lock? There is only a short lock period in the queue
when an item is put in the queue or got from the queue. And of course we
have the GIL, but this is released as soon as a long during operation is
started - in this case when the Internet communication is done.

Also, the code is creating 5 threads, but using join() on only the last
one.

grocery_stocker · May 9, 2009

grocery_stocker <[email protected]> (gs) wrote:

Click to expand...

gs> The following code gets data from 5 different websites at the "same
gs> time".
gs> #!/usr/bin/python
gs> import Queue
gs> import threading
gs> import urllib2
gs> import time
gs> hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
gs> "http://ibm.com", "http://apple.com"]
gs> queue = Queue.Queue()
gs> class MyUrl(threading.Thread):
gs> def __init__(self, queue):
gs> threading.Thread.__init__(self)
gs> self.queue = queue
gs> def run(self):
gs> while True:
gs> host = self.queue.get()
gs> if host is None:
gs> break
gs> url = urllib2.urlopen(host)
gs> print url.read(1024)
gs> #self.queue.task_done()
gs> start = time.time()
gs> def main():
gs> for i in range(5):
gs> t = MyUrl(queue)
gs> t.setDaemon(True)
gs> t.start()
gs> for host in hosts:
gs> print "pushing", host
gs> queue.put(host)
gs> for i in range(5):
gs> queue.put(None)
gs> t.join()
gs> if __name__ == "__main__":
gs> main()
gs> print "Elapsed Time: %s" % (time.time() - start)
gs> How does the parallel download work if each thread has a lock? When
gs> the program openswww.yahoo.com, it places a lock on the thread,
gs> right? If so, then doesn't that mean the other 4 sites have to wait
gs> for the thread to release the lock?

Click to expand...

No. Where does it set a lock? There is only a short lock period in the queue
when an item is put in the queue or got from the queue. And of course we
have the GIL, but this is released as soon as a long during operation is
started - in this case when the Internet communication is done.
--

Maybe I'm being a bit daft, but what prevents the data from www.yahoo.com
from being mixed up with the data from www.google.com? Doesn't using
queue() prevent the data from being mixed up?

Piet van Oostrum · May 9, 2009

[snip]

gs> Maybe I'm being a bit daft, but what prevents the data from www.yahoo.com
gs> from being mixed up with the data from www.google.com? Doesn't using
gs> queue() prevent the data from being mixed up?

Nothing in your script prevents the data from getting mixed up. Now it
seems from some experimentation that the print statements might be atomic,
although I can't find anything about that in the Python doc, and I think
you shouldn't count on that. I would expect it not to be atomic when it
does a blocking I/O.

If I make your example more complete, printing the documents completely,
like:

def run(self):
while True:
host = self.queue.get()
if host is None:
break
url = urllib2.urlopen(host)
while True:
txt = url.read(1024)
if not txt: break
print txt,

then the document will get mixed up in the output. Likewise if you would
want to put them in a shared datastructure, you must use locking when you
insert them (for example you could put them in another Queue).

The Queue you use here only prevent the urls from getting mixed up, but
it has no effect on the further processing.

As I told my students two days ago: you shouldn't do thread programming
unless you have thoroughly studied the subject.

By the way there is another flaw in your program: you do the join only
on the last spawned thread. Because the threads are daemonic all other
threads that are still working will be killed prematurely when this
thread finishes.

The code should be like this:

def main():
threads = []
for i in range(5):
t = MyUrl(queue)
t.setDaemon(True)
t.start()
threads.append(t)
....
for t in threads:
t.join()

Or just don't make them daemonic.

Dennis Lee Bieber · May 9, 2009

def main():
for i in range(5):
t = MyUrl(queue)
t.setDaemon(True)
t.start()

for host in hosts:
print "pushing", host
queue.put(host)

There is no guarantee that pushing five URLs/hosts will occupy five
threads. Nor that the threads will run in the order created. It is
perfectly viable that the operations occur as:

main t1 t2 t3 t4 t5
start t1
get/block
start t2
get/block
start t3
get/block
start t4
get/block
start t5
get/block
put h1
gets h1
start dl
put h2
get h2
start dl
finish dl
get/block
put h3
get h3
finish dl
start dl

etc...

Granted, it is unlikely -- I'd suspect the implementation for Queue
puts the blocking gets into an ordered list, so the first get to block
will be the first get to be released. But, theoretically, an
implementation is free to toss dice and choose any waiting get operation
to be the one released whenever a task switch takes place.

t.join()

Note that with only ONE "t" object, your .join() only waits for one
thread to complete -- the last one created. Which, as the above
theoretical indicates, could be the first one to obtain a None and exit
immediately, while others are still downloading.

If you are going to create /n/ threads, and later need to .join()
those threads, you need to keep a reference to EACH of them.

#create threads
theThreads = [ MyURL(queue) for i in range(5)]
#daemonize -- NOT IF YOU INTEND TO .join() them!
#daemons are threads that are to shutdown when the main exits
#but you want the main to wait for them to shutdown first
##for t in theThreads: t.setDaemon(True)

#preload the URL data
for host in hosts: queue.put(host)

#start the threads
for t in theThreads: t.start()

#load the EOD, using theThreads makes sure one for each
for t in theThreads: queue.put(None)

#wait for them to shutdown
for t in theThreads: t.join()
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Threading change, 2.5.4 -> 2.6.1	4	Jan 7, 2010
Threading with queues	7	Dec 21, 2009
how does a queue stop the thread?	5	Apr 21, 2010
Threading: Method trigger after thred finished	2	Oct 19, 2011
Some help needed with small multi-threaded program!	1	May 17, 2010
AssertionError - help me to solve this in a programme with Queue	4	May 12, 2009
how to use two threads to produce even and odd numbers?	2	Jun 14, 2013
Interrput a thread	22	Dec 29, 2010

Thread locking question.

grocery_stocker

Piet van Oostrum

MRAB

grocery_stocker

Piet van Oostrum

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads