Mechanoid Web Browser - Recording Capability

Seymour · Sep 16, 2006

I am trying to find a way to sign onto my Wall Street Journal account
(http://online.wsj.com/public/us) and automatically download various
financial pages on stocks and mutual funds that I am interested in
tracking. I have a subscription to this site and am trying to figure
out how to use python, which I have been trying to learn for the past
year, to automatically login and capture a few different pages.
I have mastered capturing web pages on non-password sites, but am
struggling otherwise and have been trying to learn how to program the
Mechanoid module (http://cheeseshop.python.org/pypi/mechanoid) to get
past the password protected site hurdle.

My questions are:
1. Is there an easier way to grab these pages from a password protected
site, or is the use of Mechanoid a reasonable approach?
2. Is there an easy way of recording a web surfing session in Firefox
to see what the browser sends to the site? I am thinking that this
might help me better understand the Mechanoid commands, and more easily
program it. I do a fair amount of VBA Programming in Microsoft Excel
and have always found the Macro Recording feature a very useful
starting point which has greatly helped me get up to speed.

Thanks for your help/insights.
Seymour

John J. Lee · Sep 17, 2006

Seymour said:
I am trying to find a way to sign onto my Wall Street Journal account
(http://online.wsj.com/public/us) and automatically download various
financial pages on stocks and mutual funds that I am interested in
tracking. I have a subscription to this site and am trying to figure [...]
My questions are:
1. Is there an easier way to grab these pages from a password protected
site, or is the use of Mechanoid a reasonable approach?

This is the first time I heard of anybody using mechanoid. As the
author of mechanize, of which mechnoid is a fork, I was always in the
dark about why the author decided to fork it (he hasn't emailed
me...).

I don't know if there's any activity on the mechanoid project, but I'm
certainly still working on mechanize, and there's an active mailing list:

http://wwwsearch.sourceforge.net/

https://lists.sourceforge.net/lists/listinfo/wwwsearch-general

2. Is there an easy way of recording a web surfing session in Firefox
to see what the browser sends to the site? I am thinking that this
might help me better understand the Mechanoid commands, and more easily
program it. I do a fair amount of VBA Programming in Microsoft Excel
and have always found the Macro Recording feature a very useful
starting point which has greatly helped me get up to speed.

With Firefox, you can use the Livehttpheaders extension:

http://livehttpheaders.mozdev.org/

The mechanize docs explain how to turn on display of HTTP headers that
it sends.

Going further, certainly there's at least one HTTP-based recorder for
twill, which actually watches your browser traffic and generates twill
code for you (twill is a simple language for functional testing and
scraping built on top of mechanize):

http://twill.idyll.org/

http://darcs.idyll.org/~t/projects/scotch/doc/

That's not an entirely reliable process, but some people might find it
helpful.

I think there may be one for zope.testbrowser too (or ZopeTestBrowser
(sp?), the standalone version that works without Zope) -- I'm not
sure. (zope.testbrowser is also built on mechanize.) Despite the
name, I'm told this can be used for scraping as well as testing.

I would imagine that it would be fairly easy to modify or extend
Selenium IDE to emit mechanize or twill or zope.testbrowser (etc.)
code (perhaps without any coding, I used too many Firefox Selenium
plugins and now forget which had which features). Personally I would
avoid using Selenium itself to actually automate tasks, though, since
unlike mechanize &c., Selenium drags in an entire browser, which
brings with it some inflexibility (though not as bad as in the past).
It does have advantages though: most obviously, it knows JavaScript.

John

John J. Lee · Sep 17, 2006

Seymour said:
struggling otherwise and have been trying to learn how to program the
Mechanoid module (http://cheeseshop.python.org/pypi/mechanoid) to get
past the password protected site hurdle.

My questions are:
1. Is there an easier way to grab these pages from a password protected
site, or is the use of Mechanoid a reasonable approach?

[...]

Again, can't speak for mechanoid, but it should be straightforward
with mechanize (simplifiying one of the examples from the URL below):

http://wwwsearch.sourceforge.net/mechanize/

br = Browser()
br.add_password("http://example.com/protected/", "joe", "password")
br.set_debug_http(True) # Print HTTP headers.
br.open("http://www.example.com/protected/blah.html")
print br.response().read()

John

John J. Lee · Sep 17, 2006

Seymour said:
I am trying to find a way to sign onto my Wall Street Journal account
(http://online.wsj.com/public/us) and automatically download various
financial pages on stocks and mutual funds that I am interested in
tracking. I have a subscription to this site and am trying to figure
out how to use python, which I have been trying to learn for the past
year, to automatically login and capture a few different pages.

[...]

Just to add: It's quite possible that site has an "no scraping"
condition in their terms of use. It seems standard legal boilerplate
on commercial sites these days. Not a good thing on the whole, I tend
to think, but you should be aware of it.

John

Seymour · Sep 18, 2006

Thanks John!
Lots of great leads in your post that I am busy looking at. I did try
one program, MaxQ, that records web surfing. It seems to work great.
I have looked at all of your leads and plan to give them all a try.
BTW, I am not sure how I came accross Mechanoid before Mechanize, but I
did and started to study that. Somehow I had the notion that
Mechanize was a Pearl script.
Thanks again,
Seymour

Seymour said:
Seymour said:

I am trying to find a way to sign onto my Wall Street Journal account
(http://online.wsj.com/public/us) and automatically download various
financial pages on stocks and mutual funds that I am interested in
tracking. I have a subscription to this site and am trying to figure
out how to use python, which I have been trying to learn for the past
year, to automatically login and capture a few different pages.

Click to expand...

[...]

Just to add: It's quite possible that site has an "no scraping"
condition in their terms of use. It seems standard legal boilerplate
on commercial sites these days. Not a good thing on the whole, I tend
to think, but you should be aware of it.

John

John J. Lee · Sep 19, 2006

Seymour said:
Somehow I had the notion that Mechanize was a Pearl script.

mechanize the Python module started as a port of Andy Lester's Perl
module WWW::Mechanize (in turn based on Gisle Aas' libwww-perl), and
on some very high level has "the same" conceptual interface, but most
of the details (internal structure, features and bugs ;-) are
different to LWP and WWW::Mechanize due to the integration with
urllib2, httplib and friends, and with my own code. Most parts of the
code are no longer recognisable as having originated in LWP (and of
course, lots of it *didn't* originate there).

John

John J. Lee · Sep 19, 2006

Seymour said:
one program, MaxQ, that records web surfing. It seems to work great.

[...]

There are lots of such programs about (ISTR twill used to use MaxQ for
its recording feature, but I think Titus got rid of it in favour of
his own code, for some reason). How useful they are depends on the
details of what you're doing: the information that goes across HTTP is
on a fairly low level, so e.g., most obviously, you may need to be
sending a session ID that varies per-request. Those programs usually
have some way of dealing with that specific problem, but you may run
into other problems that have the same origin. Don't let me put you
off it gets your job done, but it's good to be a bit wary: All current
web-scraping approaches using free software suck in one way or
another.

John

alex_f_il · Sep 20, 2006

You can try SWExplorerAutomation (SWEA) (http:\\webunittesting.com).
It works very well with the password protected sites. SWEA is .Net API,
but you can use IronPython to access it.

How to effectively develop a web application from scratch?	0	Jul 2, 2023
Python simulate browser activity	4	Mar 16, 2012
splinter web browser simulator causing constant BSODs	1	Apr 11, 2013
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
Simple web framework - improvements to makefile	0	Feb 1, 2023
Why is Ruby on Rails more popular than Django?	24	Mar 6, 2013
Looking to change programming direction	1	Aug 10, 2022
Web Page Parsing/Downloading	1	Nov 22, 2013

Mechanoid Web Browser - Recording Capability

Seymour

John J. Lee

John J. Lee

John J. Lee

Seymour

John J. Lee

John J. Lee

alex_f_il

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads