Oh what a twisted thread we weave....

G

GregM

Hi

First off I'm not using anything from Twisted. I just liked the subject
line :)

The folks of this list have been most helpful before and I'm hoping
that you'll take pity on a the dazed and confused. I've read stuff on
this group and various website and book until my head is spinning...

Here is a brief summary of what I'm trying to do and an example below.
I have the code below in a single threaded version and use it to test a
list of roughly 6000 urls ensure that they "work". If they fail I track
the kind of failures and then generate a report. Currently it take
about 7 - 9 hours to run through the entire list. I basically create a
list from a file containing a list of URLS and then iterate over the
list and check each page as I go through the list. I get all sort of
flack because it takes so long so I thought I could speed it up by
using a Queue and X number of threads. Seems easier said then done.

However in my test below I can't even get it to catch a single error in
my if statement in the Run() function. I'm stumped as to why. Any help
would be Greatly appreciated. and if so inclined pointers on how to
limit the number of threads of a give number of threads.

Thank you in advance! I really do appreciate it

here is what I have so far... Yes there are somethings that are unused
from previous test. Oh and to give proper credit this is based on some
code from http://starship.python.net/crew/aahz/OSCON2000/SCRIPT2.HTM

import threading, Queue
from time import sleep, time
import urllib2
import formatter
import string
#toscan = Queue.Queue
#scanned = Queue.Queue
#workQueue = Queue.Queue()


MAX_THREADS = 10

timeout = 90 # sets timeout for urllib2.urlopen()
failedlinks = [] # list for failed urls
zeromatch = [] # list for 0 result searches
t = 0 # used to store starting time for getting a page.
pagetime = 0 # time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10 # fastest page time
cumulative = 0 # total time to load all pages (used to calc. avg)
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

class Retriever(threading.Thread):
def __init__(self, URL):
self.done = 0
self.URL = URL
self.urlObj = ''
self.ST_zeroMatch = ST_zeroMatch
print '__init__:self.URL', self.URL
threading.Thread.__init__(self)

def run(self):
print 'In run()'
print "Retrieving:", self.URL
#self.page = urllib.urlopen(self.URL)
#self.body = self.page.read()
#self.page.close()
self.t = time()
self.urlObj = urllib2.urlopen(self.URL)
self.pagetime = time() - t
self.webpg = self.urlObj.read()
print 'Retriever.run: before if'
print 'matching', self.ST_zeroMatch
print ST_zeroMatch
# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:
# I don't think I want to use self.zeromatch, do I?
print '** Found zeromatch'
zeromatch.append(url)
#self.parse()
print 'Retriever.run: past if'
print 'exiting run()'
self.done = 1

# the last 2 Shop.Com Urls should trigger the zeromatch condition
sites = ['http://www.foo.com/',
'http://www.shop.com',
'http://www.shop.com/op/aprod-~zzsome+thing',
'http://www.shop.com/op/aprod-~xyzzy'
#'http://www.yahoo.com/ThisPageDoesntExist'
]

threadList = []
URLs = []
workQueue = Queue.Queue()

for item in sites:
workQueue.put(item)

print workQueue
print
print 'b4 test in sites'

for test in sites:
retriever = Retriever(test)
retriever.start()
threadList.append(retriever)

print 'threadList:'
print threadList
print 'past for test in sites:'

while threading.activeCount()>1:
print'Zzz...'
sleep(1)

print 'entering retriever for loop'
for retriever in threadList:
#URLs.extend(retriever.run())
retriever.run()

print 'zeromatch:', zeromatch
# even though there are two URLs that that should be here nothing ever
gets appeneded to the list.
 
T

Tom Anderson

ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:

This code - i do not think it means what you think it means. Specifically,
it doesn't mean "is either of ST_zeroMatch or ST_zeroMatch2 in
self.webpg"; what it means is "apply the 'or' opereator to ST_zeroMatch
and ST_zeroMatch2, then check if the result is in self.webpg". The result
of applying the or operator to two nonempty strings is the left-hand
string; your code is thus equivalent to

if ST_zeroMatch in self.webpg:

Which will work in cases where your page says 'You found 0 products', but
not in cases where it says 'There are no products matching your
selection'.

What you want is:

if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):

Or something like that.

You say that you have a single-threaded version of this that works;
presumably, you have a working version of this logic in there. Did you
write the threaded version from scratch? Often a bad move!

tom
 
G

GregM

Tom,

Thanks for the reply and sorry for the delay in getting back to you.
Thanks for pointing out my logic problem. I had added the 2nd part of
the if statement at the last minute...

Yes I have a single threaded version its several hundred lines and uses
COM to write the results out to and Excel spreadsheet.. I was trying to
better understand threading and queues before I started hacking on my
current code... maybe that was a mistake... hey I'm still learning and
I learn a lot just by reading stuff posted to this group. I hope at
some point I can help others in the same way.

Here are the relevent parts of the code (no COM stuff)

here is a summary:
# see if url exists
# if exists then
# hit page
# get text of page
# see if text of page contains search terms
# if it does then
# update appropiate counters and lists
# else update static line and do the next one
# when done with Links list
# - calculate totals and times
# - write info to xls file
# end.

# utils are functions and classes that I wrote
# from utils import PrintStatic, HttpExists2
#
# My version of 'easyExcel' with extentions and improvements.
# import excelled
import urllib2
import time
import socket
import os
#import msvcrt # for printstatic
from datetime import datetime
import pythoncom
from sys import exc_info, stdout, argv, exit

# search terms to use for matching.
#primarySearchTerm = 'Narrow your'
ST_lookingFor = 'Looking for Something'
ST_errorConnecting = 'there has been an error connecting'
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

#initialize Globals
timeout = 90 # sets timeout for urllib2.urlopen()
failedlinks = [] # list for failed urls
zeromatch = [] # list for 0 result searches
pseudo404 = [] # list for shop.com 404 pages
t = 0 # used to store starting time for getting a page.
count = 0 # number of tests so far
pagetime = 0 # time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10 # fastest page time
cumulative = 0 # total time to load all pages (used to calc. avg)

#version number of the program
version = 'B2.9'

def ShopCom404(testUrl):
""" checks url for shop.com 404 url
shop.com 404 url -- returns status 200
http://www.shop.com/amos/cc/main/404/ccsyn/260
"""
if '404' in testUrl:
return True
else:
return False

##### main program #####

try:
links = open(testfile).readlines()
except:
exc, err, tb = exc_info()
print 'There is a problem with the file you specified. Check the file
and re-run the program.\n'
#print str(exc)
print str(err)
print
exit()

# timeout in seconds
socket.setdefaulttimeout(timeout)
totalNumberTests = len(links)
print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n'
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
start = datetime.today()
for url in links:
count = count + 1
#HttpExists2 - checks to see if URL exists and detects redirection.
# handles 404's and exceptions better. Returns tuple depending on
results:
# if found: true and final url. if not found: false and attempted url
pgChk = HttpExists2(url)
if pgChk[0] == False:
#failed url Exists
failedlinks.append(pgChk[1])
elif ShopCom404(pgChk[1]):
#Our version of a 404
pseudo404.append(url)
if pgChk[0] and not ShopCom404(url):
#if valid page not a 404 then get the page and check it.
try:
t = time.time()
urlObj = urllib2.urlopen(url)
pagetime = time.time() - t
webpg = urlObj.read()
if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):
zeromatch.append(url)
elif ST_errorConnecting in webpg:
# for some reason we got the error page
# so add it to the failed urls
failmsg = 'Error Connecting Page with: ' + url
failedlinks.append(failmsg)
except:
print 'exception with: ' + url
#figure page times
cumulative += pagetime
if pagetime > slowestpage:
slowestpage = pagetime, url.strip()
elif pagetime < fastestpage:
fastestpage = pagetime, url.strip()
msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \
'. Currnet runtime: ' + str(datetime.today() - start)
# status message that updates the same line.
#PrintStatic(msg)

### Now write out results
end = datetime.today()
finished = datetime.today()
finishedTimeStr = time.asctime()
avg = cumulative/totalNumberTests
failed = len(failedlinks)
nomatches = len(zeromatch)

#setup COM connection to Excel and write the spreadsheet.

If I understand what I've read about threading I need to convert much
of the above into a function and then call threading.thread start or
run to fire off each thread. but where and how and how to limit to X
number of threads is the part I get lost on. The example I've seen
using queues and threads never show using a list (squence) for the
source data and I'm not sure where I'd use the Queue stuff or for that
mattter if I'm just complicating the issue.

Once again thanks for the help.
Greg.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,119
Latest member
IrmaNorcro
Top