Why does this fail?

D

Dave Murray

New to Python question, why does this fail?

Thanks,
Dave

---testcase.py---
import sys, urllib, htmllib
def Checkit(URL):
try:
print "Opening", URL
f = urllib.open(URL)
f.close()
return 1
except:
return 0

rtfp = Checkit("http://www.python.org/doc/Summary.html")
if rtfp == 1:
print "OK"
else:
print "Fail"


python testcase.py
 
M

Mark McEahern

New to Python question, why does this fail?
[...]
[...]
except:
[...]

Because you're treating all errors as if they're what you expect. You
should be more specific in your except clause. Do this and you'll see
what I mean:

try:
whatever
except:
raise # raise whatever exception occurred
return 0

In other words, you should be explicit about the errors you silence.

Also, it's not clear what Checkit() is actually supposed to do. Is it
supposed to verify the URL actually exists? urllib doesn't raise an
error for 404 not found--urllib2 does. Try that instead.

Cheers,

// m
 
I

Isaac To

Dave> New to Python question, why does this fail? Thanks, Dave

Dave> f = urllib.open(URL)

urllib does not have an open function. Instead, it has a constructor called
URLopener, which creates an object with such a method. So instead, you have
to say

opener = urllib.URLopener()
f = opener(URL)

Regards,
Isaac.
 
D

Dave Murray

Thank you all, this is a hell of a news group. The diversity of answers
helped me with some unasked questions, and provided more elegant solutions
to what I thought that I had figured out on my own. I appreciate it.

It's part of a spider that I'm working on to verify my own (and friends) web
page and check for broken links. Looks like making it follow robot rules
(robots.txt and meta field exclusions) is what's left.

I have found the library for html/sgml to be not very robust. Big .php and
..html with lot's of cascades and external references break it very
ungracefully (sgmllib.SGMLParseError: expected name token). I'd like to be
able to trap that stuff and just move on to the next file, accepting the
error. I'm reading in the external links and printing the title as a sanity
check in addition to collecting href anchors. This problem that I asked
about reared it's head when I started testing for a robots.txt file, which
may or may not exist.

The real point is to learn the language. When a new grad wrote a useful
utility at work in Python faster than I could have written it in C I decided
that I needed to learn Python. He's very sharp but he sold me on the
language too. Since I often must write utilities, Python seems to be a very
good thing since I normally don't have much time to kill on them.

Dave
 
M

Mark McEahern

]
I have found the library for html/sgml to be not very robust. Big .php and
.html with lot's of cascades and external references break it very
ungracefully (sgmllib.SGMLParseError: expected name token).

I'd suggest using htmllib.

// m
 
A

Anand Pillai

I could not help replying to this thread...

There are already quite a lot of spider programs existing
in Python. I am the author of one of the first programs of
the kind, called HarvestMan. It is multithreaded and has
many features for downloading websites, checking links etc.
You can get it from the HarvestMan homepage at
http://harvestman.freezope.org. HarvestMan is quite
comprehensive and is a bit more than a link checker or
web crawler. My feeling is that it is not easy to understand
for a Python beginner though the program is distributed
as source code in true Python tradition.

If you want something simpler, try spider.py. You can get
information on it from the PyPi pages.

My point was that, there is nothing to gain from re-inventing
the wheel again and again. Spider programs have been written in
Python, so you should try to use them rather than writing code
from scratch. If you think that you are having new ideas, then
take the code of HarvestMan(or spider) and customize it or
improve on it. I will be happy to merge the changes back in the
code if I think they improve the program, if it is for HarvestMan.

This is the main reason why developers release programs as
opensource. Help the community, and help yourselves. Re-inventing
the wheel is perhaps not the way to go.

best regards

-Anand
 
D

Dave Murray

Thank you for the information. I will check them out after I finish my
effort. My purpose isn't to obtain a spider program, it is to learn Python
by doing. If the exercise will result in something that I can use, it gives
me incentive to not abandon the effort because the exercise is interesting
to me. The sources that you pointed out should be rich in information on how
I could have done it better if I had been more experienced in Python
(knowledgeable about it's libraries, etc.)

Whenever I learn something new I like to work at it, get help if I'm stuck
on something silly (why waste time?), assess what I did against a higher
standard, repeat. It's just the way that I learn. I can see that this forum
will be just what I need for a chunk of that process. I appreciate it.

Regards,
Dave

----- Original Message -----
 
D

Dave Murray

After re-reading this part, I can see that it is an idea that I like. How
does participating in open source work for someone (me) who has signed the
customary intellectual property agreement with the corporation that they
work for? Since programming is part of my job, developing test solutions
implemented on automatic test equipment (the hardware too) I don't know if I
would/could be poison to an open source project. How does that work? I've
never participated. If all the work is done on someone's own time, not using
company resources, yadda-yadda-hadda-hadda, do corporate lawwwyaahhhs have a
history of trying to dispute that and stake a claim? No doubt, many of you
are in the same position.

Regards,
Dave
 
S

Skip Montanaro

Dave> How does participating in open source work for someone (me) who
Dave> has signed the customary intellectual property agreement with the
Dave> corporation that they work for? Since programming is part of my
Dave> job, developing test solutions implemented on automatic test
Dave> equipment (the hardware too) I don't know if I would/could be
Dave> poison to an open source project. How does that work?

Only your corporate counsel knows for sure. <wink> Seriously, the degree to
which you are allowed to release code to an open source project and the
manner in which is released is probably a matter best taken up with your
company's legal department. Some companies are fairly enlightened. Some
are not. You may need very little review to release bug fixes or test cases
(my guess is you might be pretty good at writing test cases ;-), more review
to release a new module or package, and considerable participation by
management and the legal eagles if you want to release a sophisticated
application into the wild.

In any case, if you make large contributions to an open source project such
as Python, I'm pretty sure a release form for substantial amounts of code
will be required at the Python end of things. See here

http://www.python.org/psf/psf-contributor-agreement.html

for more details. Note that it hasn't been updated in a couple years. I
don't know if MAL has something which is more up-to-date.

Skip
 
S

Samuel Walters

|Thus Spake Dave Murray On the now historical date of Sun, 04 Jan 2004
23:54:57 -0700|
After re-reading this part, I can see that it is an idea that I like.
How does participating in open source work for someone (me) who has
signed the customary intellectual property agreement with the
corporation that they work for? Since programming is part of my job,
developing test solutions implemented on automatic test equipment (the
hardware too) I don't know if I would/could be poison to an open source
project. How does that work? I've never participated. If all the work is
done on someone's own time, not using company resources,
yadda-yadda-hadda-hadda, do corporate lawwwyaahhhs have a history of
trying to dispute that and stake a claim? No doubt, many of you are in
the same position.

IANAL (I Am Not A Lawyer)

As suggested elsewhere, consult your legal counsel. Dig up that NDA. Go
to the corporate lawwwyaahhhs and ask them to provide you with a clear
delineation in writing. Have your legal counsel look over that document
to make sure it says what you think it says. Be prepared to explain the
difference between general purpose tools and special purpose tools
directly related to the job. Specifically, be prepared to explain how
contributing to general purpose tools can allow you to more quickly (and
inexpensively, time is money yadda-yadda) develop the special purpose
tools. By contributing to, say, a web spider when your business involves
stress-testing web servers would allow you to leverage the knowledge and
work of others towards the companies goals. As I understand it, no
open-source license has yet been tested in court, so your guess is as good
as anyone's about how much risk is involved. That's why everyone is
waiting with baited breath over the SCO vs IBM fiasco. It may be that
first legal test. In fact, go to www.groklaw.net and read up on the SCO
vs IBM suit. That's as good of a starting place as any.

Oh, and be sure to take a look at the specific license involved in a
project you contribute to. Some licenses, like BSD, have little to no
restrictions on how an individual or company uses the code. Most, such as
GPL require that you simply distribute the source and any changes you've
made if and only if you distribute the product or any products including
code from the project to a third party (in the case of companies, that
means outside the companies.) YMMV and again, IANAL

HTH

Sam Walters
 
P

Peter Hansen

Dave said:
After re-reading this part, I can see that it is an idea that I like. How
does participating in open source work for someone (me) who has signed the
customary intellectual property agreement with the corporation that they
work for? Since programming is part of my job, developing test solutions
implemented on automatic test equipment (the hardware too) I don't know if I
would/could be poison to an open source project. How does that work? I've
never participated. If all the work is done on someone's own time, not using
company resources, yadda-yadda-hadda-hadda, do corporate lawwwyaahhhs have a
history of trying to dispute that and stake a claim? No doubt, many of you
are in the same position.

My own agreement, which is not quite as archaic in restricting me as some I've
seen, boils down to saying that if I work on something that is either (done
on company time or with company resources) OR (relates to the current or
likely future business of the company) then I'm agreeing that the company
in effect gets an exclusive right to whatever it is.

If, on the other hand, it's on my own time AND does not involve what the
company's business is (in contrast to, say, simply relating to tools that
they might use within the business), then they don't get any right to it.
We use various test tools at work, but just because I work on a similar
open source test tool doesn't mean the company has any exclusive right to it.
We sell RF stuff, not test tools, so test tools are not the company's
business, nor are they likely ever to be...

I believe many or most agreements these days boil down to the same thing,
but of course your own might not so reading it would be a good idea.

Generally there is lots of boilerplate legalese but it surrounds one or
two key paragraphs of fairly simple English with the essence described above,
and it's not as hard to dig the key ideas out as it might seem at first glance.

-Peter
 
J

John J. Lee

Dave Murray said:
New to Python question, why does this fail? [...]
def Checkit(URL):
[...]

(already answered six times, so I won't bother...)

You might want to have a look at the unittest module.

Also (advert ;-), if you're doing any kind of web scraping in Python
(including functional testing), you might want to look at this little
FAQ (though it certainly doesn't nearly cover everything relevant):

http://wwwsearch.sf.net/bits/clientx.html

BTW, in response to another question in this thread (IIRC), and
entirely contrary to my previous assertion here <wink>, it appears
that HTMLParser.HTMLParser is a bit more finicky with HTML than is
sgmllib/htmllib (htmllib is a thin wrapper over sgmllib). I hope to
investigate and fix that -- HTMLParser.HTMLParser knows about XHTML,
so in that respect is a better choice than sgmllib/htmllib. If you
want to process junk HTML, though (or perhaps even valid HTML that the
library you're using doesn't like), look at mxTidy or uTidylib. I
should link to those on my FAQ page...


John
 
S

Skip Montanaro

John> Also (advert ;-), if you're doing any kind of web scraping in
John> Python (including functional testing), you might want to look at
John> this little FAQ (though it certainly doesn't nearly cover
John> everything relevant):

John> http://wwwsearch.sf.net/bits/clientx.html

A possible addition to your "Embedded script is messing up my web-scraping"
section: Wasn't there mention of the Mozilla project's standalone
JavaScript interpreter (don't remember what it's called) recently alongside
some Python interface?

Skip
 
J

John J Lee

]
John> http://wwwsearch.sf.net/bits/clientx.html

A possible addition to your "Embedded script is messing up my web-scraping"
section: Wasn't there mention of the Mozilla project's standalone
JavaScript interpreter (don't remember what it's called) recently alongside
some Python interface?

(Just finished updating that page a few seconds ago, BTW.)

I don't remember that, other than PyXPCOM (linked to from that page -- at
least, it is now ;-) and my own

http://wwwsearch.sf.net/python-spidermonkey
http://wwwsearch.sf.net/DOMForm

Be warned, all my JavaScript-support code is very alpha (DOMForm itself
shouldn't be anywhere near so bad, but still very early alpha).


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top