Thread locking question.

  • Thread starter grocery_stocker
  • Start date

G

grocery_stocker

The following code gets data from 5 different websites at the "same
time".

#!/usr/bin/python

import Queue
import threading
import urllib2
import time

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]

queue = Queue.Queue()

class MyUrl(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue

def run(self):
while True:
host = self.queue.get()
if host is None:
break
url = urllib2.urlopen(host)
print url.read(1024)
#self.queue.task_done()

start = time.time()

def main():
for i in range(5):
t = MyUrl(queue)
t.setDaemon(True)
t.start()

for host in hosts:
print "pushing", host
queue.put(host)

for i in range(5):
queue.put(None)

t.join()

if __name__ == "__main__":
main()
print "Elapsed Time: %s" % (time.time() - start)


How does the parallel download work if each thread has a lock? When
the program opens www.yahoo.com, it places a lock on the thread,
right? If so, then doesn't that mean the other 4 sites have to wait
for the thread to release the lock?
 
Ad

Advertisements

P

Piet van Oostrum

grocery_stocker said:
gs> The following code gets data from 5 different websites at the "same
gs> time".
gs> #!/usr/bin/python
gs> import Queue
gs> import threading
gs> import urllib2
gs> import time
gs> queue = Queue.Queue()
gs> class MyUrl(threading.Thread):
gs> def __init__(self, queue):
gs> threading.Thread.__init__(self)
gs> self.queue = queue
gs> def run(self):
gs> while True:
gs> host = self.queue.get()
gs> if host is None:
gs> break
gs> url = urllib2.urlopen(host)
gs> print url.read(1024)
gs> #self.queue.task_done()
gs> start = time.time()
gs> def main():
gs> for i in range(5):
gs> t = MyUrl(queue)
gs> t.setDaemon(True)
gs> t.start()
gs> for host in hosts:
gs> print "pushing", host
gs> queue.put(host)
gs> for i in range(5):
gs> queue.put(None)
gs> t.join()
gs> if __name__ == "__main__":
gs> main()
gs> print "Elapsed Time: %s" % (time.time() - start)

gs> How does the parallel download work if each thread has a lock? When
gs> the program opens www.yahoo.com, it places a lock on the thread,
gs> right? If so, then doesn't that mean the other 4 sites have to wait
gs> for the thread to release the lock?

No. Where does it set a lock? There is only a short lock period in the queue
when an item is put in the queue or got from the queue. And of course we
have the GIL, but this is released as soon as a long during operation is
started - in this case when the Internet communication is done.
 
M

MRAB

Piet said:
gs> The following code gets data from 5 different websites at the "same
gs> time". [snip]
gs> start = time.time()
gs> def main():
gs> for i in range(5):
gs> t = MyUrl(queue)
gs> t.setDaemon(True)
gs> t.start()
gs> for host in hosts:
gs> print "pushing", host
gs> queue.put(host)
gs> for i in range(5):
gs> queue.put(None)
gs> t.join()
gs> if __name__ == "__main__":
gs> main()
gs> print "Elapsed Time: %s" % (time.time() - start)

gs> How does the parallel download work if each thread has a lock? When
gs> the program opens www.yahoo.com, it places a lock on the thread,
gs> right? If so, then doesn't that mean the other 4 sites have to wait
gs> for the thread to release the lock?

No. Where does it set a lock? There is only a short lock period in the queue
when an item is put in the queue or got from the queue. And of course we
have the GIL, but this is released as soon as a long during operation is
started - in this case when the Internet communication is done.

Also, the code is creating 5 threads, but using join() on only the last
one.
 
G

grocery_stocker

grocery_stocker <[email protected]> (gs) wrote:
gs> The following code gets data from 5 different websites at the "same
gs> time".
gs> #!/usr/bin/python
gs> import Queue
gs> import threading
gs> import urllib2
gs> import time
gs> hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
gs> "http://ibm.com", "http://apple.com"]
gs> queue = Queue.Queue()
gs> class MyUrl(threading.Thread):
gs> def __init__(self, queue):
gs> threading.Thread.__init__(self)
gs> self.queue = queue
gs> def run(self):
gs> while True:
gs> host = self.queue.get()
gs> if host is None:
gs> break
gs> url = urllib2.urlopen(host)
gs> print url.read(1024)
gs> #self.queue.task_done()
gs> start = time.time()
gs> def main():
gs> for i in range(5):
gs> t = MyUrl(queue)
gs> t.setDaemon(True)
gs> t.start()
gs> for host in hosts:
gs> print "pushing", host
gs> queue.put(host)
gs> for i in range(5):
gs> queue.put(None)
gs> t.join()
gs> if __name__ == "__main__":
gs> main()
gs> print "Elapsed Time: %s" % (time.time() - start)
gs> How does the parallel download work if each thread has a lock? When
gs> the program openswww.yahoo.com, it places a lock on the thread,
gs> right? If so, then doesn't that mean the other 4 sites have to wait
gs> for the thread to release the lock?

No. Where does it set a lock? There is only a short lock period in the queue
when an item is put in the queue or got from the queue. And of course we
have the GIL, but this is released as soon as a long during operation is
started - in this case when the Internet communication is done.
--

Maybe I'm being a bit daft, but what prevents the data from www.yahoo.com
from being mixed up with the data from www.google.com? Doesn't using
queue() prevent the data from being mixed up?
 
P

Piet van Oostrum

[snip]
gs> Maybe I'm being a bit daft, but what prevents the data from www.yahoo.com
gs> from being mixed up with the data from www.google.com? Doesn't using
gs> queue() prevent the data from being mixed up?

Nothing in your script prevents the data from getting mixed up. Now it
seems from some experimentation that the print statements might be atomic,
although I can't find anything about that in the Python doc, and I think
you shouldn't count on that. I would expect it not to be atomic when it
does a blocking I/O.

If I make your example more complete, printing the documents completely,
like:

def run(self):
while True:
host = self.queue.get()
if host is None:
break
url = urllib2.urlopen(host)
while True:
txt = url.read(1024)
if not txt: break
print txt,

then the document will get mixed up in the output. Likewise if you would
want to put them in a shared datastructure, you must use locking when you
insert them (for example you could put them in another Queue).

The Queue you use here only prevent the urls from getting mixed up, but
it has no effect on the further processing.

As I told my students two days ago: you shouldn't do thread programming
unless you have thoroughly studied the subject.

By the way there is another flaw in your program: you do the join only
on the last spawned thread. Because the threads are daemonic all other
threads that are still working will be killed prematurely when this
thread finishes.

The code should be like this:

def main():
threads = []
for i in range(5):
t = MyUrl(queue)
t.setDaemon(True)
t.start()
threads.append(t)
....
for t in threads:
t.join()

Or just don't make them daemonic.
 
Ad

Advertisements

D

Dennis Lee Bieber

def main():
for i in range(5):
t = MyUrl(queue)
t.setDaemon(True)
t.start()

for host in hosts:
print "pushing", host
queue.put(host)
There is no guarantee that pushing five URLs/hosts will occupy five
threads. Nor that the threads will run in the order created. It is
perfectly viable that the operations occur as:

main t1 t2 t3 t4 t5
start t1
get/block
start t2
get/block
start t3
get/block
start t4
get/block
start t5
get/block
put h1
gets h1
start dl
put h2
get h2
start dl
finish dl
get/block
put h3
get h3
finish dl
start dl

etc...

Granted, it is unlikely -- I'd suspect the implementation for Queue
puts the blocking gets into an ordered list, so the first get to block
will be the first get to be released. But, theoretically, an
implementation is free to toss dice and choose any waiting get operation
to be the one released whenever a task switch takes place.


Note that with only ONE "t" object, your .join() only waits for one
thread to complete -- the last one created. Which, as the above
theoretical indicates, could be the first one to obtain a None and exit
immediately, while others are still downloading.

If you are going to create /n/ threads, and later need to .join()
those threads, you need to keep a reference to EACH of them.

#create threads
theThreads = [ MyURL(queue) for i in range(5)]
#daemonize -- NOT IF YOU INTEND TO .join() them!
#daemons are threads that are to shutdown when the main exits
#but you want the main to wait for them to shutdown first
##for t in theThreads: t.setDaemon(True)

#preload the URL data
for host in hosts: queue.put(host)

#start the threads
for t in theThreads: t.start()

#load the EOD, using theThreads makes sure one for each
for t in theThreads: queue.put(None)

#wait for them to shutdown
for t in theThreads: t.join()
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top