retrieving https pages

E

Eric

I'm using Linux - Manriva LE2005, python 2.3 (or i can also use python 2.4
on my other system just as well).
Anyways...
I want to get a web page containing my stock grants.
The initial page is an https and there is a form on it to
fill in your username and password and then click "login"
I played with python's urlopen and basically it complains "your browser
doesnt support frames" meaning the urlopen call makes it unhappy somehow.
Is it reasonable to think i can build a script to login to this secure
website, move to a different page (on that site) and download it to disk?
Or am i just looking at a ling complicated task.
I'd really like to get the page because then i can analyze it from a cron
job and email myself my current options value each week or each month.
Thanks
Eric
 
N

ncf

It might be checking the browser's User-agent. My best bet for you
would to be to use something to record the headers your browser sends
out, and mimic those in Python.

If you look at the source code for urlopener (I think you can press
Alt+M and type in "urlopener"), under the FancyURLopener definition,
you should see something like self.add_headers (not on a box to check
it right now, but it's in the constructer, I remember that much).

Just set all the headers to send out (like your browser would) by
setting that value from your script. i.e.:

import urlopener
urlopener = FancyURLopener()
urlopener.add_headers =
[('User-agent','blah'),('Header2','val'),('monkey','bone')]
# do the other stuff here :p

HTH

-Wes
 
M

Mike Meyer

Eric said:
I'm using Linux - Manriva LE2005, python 2.3 (or i can also use python 2.4
on my other system just as well).
Anyways...
I want to get a web page containing my stock grants.
The initial page is an https and there is a form on it to
fill in your username and password and then click "login"
I played with python's urlopen and basically it complains "your browser
doesnt support frames" meaning the urlopen call makes it unhappy somehow.
Is it reasonable to think i can build a script to login to this secure
website, move to a different page (on that site) and download it to disk?
Or am i just looking at a ling complicated task.

It's not that bad. It took me about half a day to do this for a site I
wanted scraped regularly, and what I had to do was much more
complicated than what you describe. I had to deal with an optional
second login page (a "security feature" of the site), http-equiv
redirects (which urlopen doesn't handle), and then digging the URL of
the page I wanted to get information from from the resulting page.

The complaint about your browser may be their inadequate attempt to
deal with browser portability by putting that on the resulting framed
page in the NOFRAMES element. In which case, you just need to find the
URL for the frame that's got the information you want, and get that
page. On the other hand, as Wes said, they may be browser-sniffing. In
which case you'll have to set the User-Agent to something they won't
complain about. Personally, I always try "Your Web Site Developer
Sucks" to see if they have a list of disallowed browsers. If that
fails, try the User-Agent string of a well-known browser.

For page scraping, install BeautifulSoup.

<mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top