[ann] CGI Link Checker 0.1

  • Thread starter Adayapalam Appaiah Kumaraswamy
  • Start date
A

Adayapalam Appaiah Kumaraswamy

Dear Python users,
I am new to Python. As I learnt a bit more on coding in Python, I
decided to try out a simple project: to write a CGI script in Python to
check links on a single HTML page on the web. Although I am just a hobby
programmer, I thought I could show it to others and ask for their
comments and suggestions. It is my first CGI script as well as my first
Python application, so you might find the coding immature. Please do
correct me wherever necessary.

I looked about around the net, but found only a few link-checking
details related to Python. So, I thought I could write a no-frills one
myself.

BTW the W3C Link Checker is written in Perl. I don't know Perl, so I
couldn't look at it for ideas.

I had to face the following problems:

1.Delayed responses for large pages: I worked around this by flushing
sys.stdout after every three links checked; that might lead to
inefficiency, but it does throw the results three at a time to the
impatient user. Otherwise, the Python interpreter would wait until the
output buffer is filled till dumping it to the web server's output.

2.Slow: I don't know how to make the script perform better. I've tried
to look into the code to make it run faster, but I couldn't do so. Also,
I think the hosting server's bandwidth may contribute to this. Still,
it takes only about 5 to 10 seconds more than the W3C validator for very
large pages, and 2 to 3 seconds more for smaller ones. Your results may
vary, I'd love to know.

3.HTML parsing: I have made no attempt to (and I do not propose to)
check pages with incorrect HTML/XHTML. This means that if the Python
HTMLParser fails, my script exits gracefully. An example of invalid HTML
is http://www.yahoo.com/.

Finally, since this is my first Python program, I might not have
properly adapted to the style of programming experienced Python users
may be accustomed to. So, I request you to please correct me in this
regard as well.

In all, it was an good experience, and gave me more than a glimpse of
the power offered by Python.

Please read the instructions on the page before entering your URL to
test the script. Remember to enter the link as http:// and don't forget
to add the slash (/) for those links which and in a directory, like

http://myserver/my/dir/

You can spawn the script from:

http://kumar.travisbsd.org/pyprogs/example.html

Personally, I have tried the following sites with this script:
http://www.w3.org/ - Works 100% perfect.
http://www.yahoo.com/ - Invalid HTML. Exits gracefully.

Source code only (meaning without the fancy images and CSS I have used):
http://kumar.travisbsd.org/pyprogs/cgilink.txt

If you want to try hosting the script on your own server, get this and
see the README (This includes all the images and fancy CSS):
http://kumar.travisbsd.org/pyprogs/cgilink-0.1.tar.gz

Thank you.
Kumar
 
C

Christopher T King

Dear Python users,
I am new to Python. As I learnt a bit more on coding in Python, I
decided to try out a simple project: to write a CGI script in Python to
check links on a single HTML page on the web. Although I am just a hobby
programmer, I thought I could show it to others and ask for their
comments and suggestions. It is my first CGI script as well as my first
Python application, so you might find the coding immature. Please do
correct me wherever necessary.

First off, good job for a first script! The interface looks very
professional, and your code is very clean.
I had to face the following problems:

1.Delayed responses for large pages: I worked around this by flushing
sys.stdout after every three links checked; that might lead to
inefficiency, but it does throw the results three at a time to the
impatient user. Otherwise, the Python interpreter would wait until the
output buffer is filled till dumping it to the web server's output.

You could probably flush the buffer after each link is checked; this
shouldn't cause any noticable overhead (the time spent checking the links
will greatly overshadow the time spent flushing the buffer), but that's
assuming the web server doesn't do any per-buffer-flush processing (which
it might, if you are using server-side-includes).
2.Slow: I don't know how to make the script perform better. I've tried
to look into the code to make it run faster, but I couldn't do so.

For the same reason as above (time is spent mostly checking the links) I
don't think tweaking the code will help much in this case. I was going to
suggest checking if urllib2 uses read-ahead buffering, but a quick check
reveals it doesn't do any... perhaps the culprit is in the HTML parsing?
3.HTML parsing: I have made no attempt to (and I do not propose to)
check pages with incorrect HTML/XHTML. This means that if the Python
HTMLParser fails, my script exits gracefully. An example of invalid HTML
is http://www.yahoo.com/.

I've seen the BeautifulSoup module recommended before as a parser that
will gracefully handle malformed HTML. It may even be faster than
HTMLParser (but this is just a guess). The homepage is
http://www.crummy.com/software/BeautifulSoup/, but it doesn't seem to be
up right now.
Finally, since this is my first Python program, I might not have
properly adapted to the style of programming experienced Python users
may be accustomed to. So, I request you to please correct me in this
regard as well.

No corrections needed :)
 
C

Christopher T King

For the same reason as above (time is spent mostly checking the links) I
don't think tweaking the code will help much in this case. I was going to
suggest checking if urllib2 uses read-ahead buffering, but a quick check
reveals it doesn't do any... perhaps the culprit is in the HTML parsing?

A further thought on the issue... the W3C's link checker might be
multithreaded, allowing it to check multiple links at the same time,
rather than waiting for each server to respond in turn. This may or may
not help in Python; Python doesn't play well with mulithreading (due to a
global interpreter lock), so whether or not you see a speedup using this
method is dependent on whether the socket module is smart enough to
release the interpreter lock (my guess is it is). Otherwise, to get the
same effect, you'd have to use the socket module directly for link
checking, in concert with the select module, which will likely get quite
messy.
 
N

Neil Hodgson

Christopher T King:
A further thought on the issue... the W3C's link checker might be
multithreaded, allowing it to check multiple links at the same time,
rather than waiting for each server to respond in turn. This may or may
not help in Python; Python doesn't play well with mulithreading (due to a
global interpreter lock), so whether or not you see a speedup using this
method is dependent on whether the socket module is smart enough to
release the interpreter lock (my guess is it is).

Multithreading works well with sockets as the GIL is released during
blocking calls. I have used multithreading for link checking and host load
testing.

Neil
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top