[ann] CGI Link Checker 0.1

Discussion in 'Python' started by Adayapalam Appaiah Kumaraswamy, Jul 13, 2004.

  1. Dear Python users,
    I am new to Python. As I learnt a bit more on coding in Python, I
    decided to try out a simple project: to write a CGI script in Python to
    check links on a single HTML page on the web. Although I am just a hobby
    programmer, I thought I could show it to others and ask for their
    comments and suggestions. It is my first CGI script as well as my first
    Python application, so you might find the coding immature. Please do
    correct me wherever necessary.

    I looked about around the net, but found only a few link-checking
    details related to Python. So, I thought I could write a no-frills one
    myself.

    BTW the W3C Link Checker is written in Perl. I don't know Perl, so I
    couldn't look at it for ideas.

    I had to face the following problems:

    1.Delayed responses for large pages: I worked around this by flushing
    sys.stdout after every three links checked; that might lead to
    inefficiency, but it does throw the results three at a time to the
    impatient user. Otherwise, the Python interpreter would wait until the
    output buffer is filled till dumping it to the web server's output.

    2.Slow: I don't know how to make the script perform better. I've tried
    to look into the code to make it run faster, but I couldn't do so. Also,
    I think the hosting server's bandwidth may contribute to this. Still,
    it takes only about 5 to 10 seconds more than the W3C validator for very
    large pages, and 2 to 3 seconds more for smaller ones. Your results may
    vary, I'd love to know.

    3.HTML parsing: I have made no attempt to (and I do not propose to)
    check pages with incorrect HTML/XHTML. This means that if the Python
    HTMLParser fails, my script exits gracefully. An example of invalid HTML
    is http://www.yahoo.com/.

    Finally, since this is my first Python program, I might not have
    properly adapted to the style of programming experienced Python users
    may be accustomed to. So, I request you to please correct me in this
    regard as well.

    In all, it was an good experience, and gave me more than a glimpse of
    the power offered by Python.

    Please read the instructions on the page before entering your URL to
    test the script. Remember to enter the link as http:// and don't forget
    to add the slash (/) for those links which and in a directory, like

    http://myserver/my/dir/

    You can spawn the script from:

    http://kumar.travisbsd.org/pyprogs/example.html

    Personally, I have tried the following sites with this script:
    http://www.w3.org/ - Works 100% perfect.
    http://www.yahoo.com/ - Invalid HTML. Exits gracefully.

    Source code only (meaning without the fancy images and CSS I have used):
    http://kumar.travisbsd.org/pyprogs/cgilink.txt

    If you want to try hosting the script on your own server, get this and
    see the README (This includes all the images and fancy CSS):
    http://kumar.travisbsd.org/pyprogs/cgilink-0.1.tar.gz

    Thank you.
    Kumar

    --
    Adayapalam Appaiah Kumaraswamy
    (Kumar Appaiah)

    Web: http://www.ee.iitm.ac.in/~ee03b091/
     
    Adayapalam Appaiah Kumaraswamy, Jul 13, 2004
    #1
    1. Advertising

  2. On Tue, 13 Jul 2004, Adayapalam Appaiah Kumaraswamy wrote:

    > Dear Python users,
    > I am new to Python. As I learnt a bit more on coding in Python, I
    > decided to try out a simple project: to write a CGI script in Python to
    > check links on a single HTML page on the web. Although I am just a hobby
    > programmer, I thought I could show it to others and ask for their
    > comments and suggestions. It is my first CGI script as well as my first
    > Python application, so you might find the coding immature. Please do
    > correct me wherever necessary.


    First off, good job for a first script! The interface looks very
    professional, and your code is very clean.

    > I had to face the following problems:
    >
    > 1.Delayed responses for large pages: I worked around this by flushing
    > sys.stdout after every three links checked; that might lead to
    > inefficiency, but it does throw the results three at a time to the
    > impatient user. Otherwise, the Python interpreter would wait until the
    > output buffer is filled till dumping it to the web server's output.


    You could probably flush the buffer after each link is checked; this
    shouldn't cause any noticable overhead (the time spent checking the links
    will greatly overshadow the time spent flushing the buffer), but that's
    assuming the web server doesn't do any per-buffer-flush processing (which
    it might, if you are using server-side-includes).

    > 2.Slow: I don't know how to make the script perform better. I've tried
    > to look into the code to make it run faster, but I couldn't do so.


    For the same reason as above (time is spent mostly checking the links) I
    don't think tweaking the code will help much in this case. I was going to
    suggest checking if urllib2 uses read-ahead buffering, but a quick check
    reveals it doesn't do any... perhaps the culprit is in the HTML parsing?

    > 3.HTML parsing: I have made no attempt to (and I do not propose to)
    > check pages with incorrect HTML/XHTML. This means that if the Python
    > HTMLParser fails, my script exits gracefully. An example of invalid HTML
    > is http://www.yahoo.com/.


    I've seen the BeautifulSoup module recommended before as a parser that
    will gracefully handle malformed HTML. It may even be faster than
    HTMLParser (but this is just a guess). The homepage is
    http://www.crummy.com/software/BeautifulSoup/, but it doesn't seem to be
    up right now.

    > Finally, since this is my first Python program, I might not have
    > properly adapted to the style of programming experienced Python users
    > may be accustomed to. So, I request you to please correct me in this
    > regard as well.


    No corrections needed :)
     
    Christopher T King, Jul 13, 2004
    #2
    1. Advertising

  3. On Tue, 13 Jul 2004, Christopher T King wrote:

    > > 2.Slow: I don't know how to make the script perform better. I've tried
    > > to look into the code to make it run faster, but I couldn't do so.

    >
    > For the same reason as above (time is spent mostly checking the links) I
    > don't think tweaking the code will help much in this case. I was going to
    > suggest checking if urllib2 uses read-ahead buffering, but a quick check
    > reveals it doesn't do any... perhaps the culprit is in the HTML parsing?


    A further thought on the issue... the W3C's link checker might be
    multithreaded, allowing it to check multiple links at the same time,
    rather than waiting for each server to respond in turn. This may or may
    not help in Python; Python doesn't play well with mulithreading (due to a
    global interpreter lock), so whether or not you see a speedup using this
    method is dependent on whether the socket module is smart enough to
    release the interpreter lock (my guess is it is). Otherwise, to get the
    same effect, you'd have to use the socket module directly for link
    checking, in concert with the select module, which will likely get quite
    messy.
     
    Christopher T King, Jul 13, 2004
    #3
  4. Adayapalam Appaiah Kumaraswamy

    Neil Hodgson Guest

    Christopher T King:

    > A further thought on the issue... the W3C's link checker might be
    > multithreaded, allowing it to check multiple links at the same time,
    > rather than waiting for each server to respond in turn. This may or may
    > not help in Python; Python doesn't play well with mulithreading (due to a
    > global interpreter lock), so whether or not you see a speedup using this
    > method is dependent on whether the socket module is smart enough to
    > release the interpreter lock (my guess is it is).


    Multithreading works well with sockets as the GIL is released during
    blocking calls. I have used multithreading for link checking and host load
    testing.

    Neil
     
    Neil Hodgson, Jul 13, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Spartanicus
    Replies:
    2
    Views:
    970
    brucie
    May 25, 2004
  2. Fred Atkinson

    Link Checker

    Fred Atkinson, Jun 9, 2005, in forum: HTML
    Replies:
    14
    Views:
    850
    data64
    Jun 12, 2005
  3. Luigi Donatello Asero

    Link checker

    Luigi Donatello Asero, Jan 6, 2006, in forum: HTML
    Replies:
    6
    Views:
    413
    Jonathan N. Little
    Jan 6, 2006
  4. Pager O Rama

    MSN BLOCK CHECKER-MSN STATUS CHECKER-MSN PROBLEMS

    Pager O Rama, Apr 4, 2006, in forum: ASP General
    Replies:
    0
    Views:
    248
    Pager O Rama
    Apr 4, 2006
  5. Jacob Grover
    Replies:
    5
    Views:
    317
    Jacob Grover
    Jul 18, 2008
Loading...

Share This Page