Mechanoid Web Browser - Recording Capability

Discussion in 'Python' started by Seymour, Sep 16, 2006.

  1. Seymour

    Seymour Guest

    I am trying to find a way to sign onto my Wall Street Journal account
    (http://online.wsj.com/public/us) and automatically download various
    financial pages on stocks and mutual funds that I am interested in
    tracking. I have a subscription to this site and am trying to figure
    out how to use python, which I have been trying to learn for the past
    year, to automatically login and capture a few different pages.
    I have mastered capturing web pages on non-password sites, but am
    struggling otherwise and have been trying to learn how to program the
    Mechanoid module (http://cheeseshop.python.org/pypi/mechanoid) to get
    past the password protected site hurdle.

    My questions are:
    1. Is there an easier way to grab these pages from a password protected
    site, or is the use of Mechanoid a reasonable approach?
    2. Is there an easy way of recording a web surfing session in Firefox
    to see what the browser sends to the site? I am thinking that this
    might help me better understand the Mechanoid commands, and more easily
    program it. I do a fair amount of VBA Programming in Microsoft Excel
    and have always found the Macro Recording feature a very useful
    starting point which has greatly helped me get up to speed.

    Thanks for your help/insights.
    Seymour
     
    Seymour, Sep 16, 2006
    #1
    1. Advertising

  2. Seymour

    John J. Lee Guest

    "Seymour" <> writes:

    > I am trying to find a way to sign onto my Wall Street Journal account
    > (http://online.wsj.com/public/us) and automatically download various
    > financial pages on stocks and mutual funds that I am interested in
    > tracking. I have a subscription to this site and am trying to figure

    [...]
    > My questions are:
    > 1. Is there an easier way to grab these pages from a password protected
    > site, or is the use of Mechanoid a reasonable approach?


    This is the first time I heard of anybody using mechanoid. As the
    author of mechanize, of which mechnoid is a fork, I was always in the
    dark about why the author decided to fork it (he hasn't emailed
    me...).

    I don't know if there's any activity on the mechanoid project, but I'm
    certainly still working on mechanize, and there's an active mailing list:

    http://wwwsearch.sourceforge.net/

    https://lists.sourceforge.net/lists/listinfo/wwwsearch-general


    > 2. Is there an easy way of recording a web surfing session in Firefox
    > to see what the browser sends to the site? I am thinking that this
    > might help me better understand the Mechanoid commands, and more easily
    > program it. I do a fair amount of VBA Programming in Microsoft Excel
    > and have always found the Macro Recording feature a very useful
    > starting point which has greatly helped me get up to speed.


    With Firefox, you can use the Livehttpheaders extension:

    http://livehttpheaders.mozdev.org/


    The mechanize docs explain how to turn on display of HTTP headers that
    it sends.


    Going further, certainly there's at least one HTTP-based recorder for
    twill, which actually watches your browser traffic and generates twill
    code for you (twill is a simple language for functional testing and
    scraping built on top of mechanize):

    http://twill.idyll.org/

    http://darcs.idyll.org/~t/projects/scotch/doc/


    That's not an entirely reliable process, but some people might find it
    helpful.

    I think there may be one for zope.testbrowser too (or ZopeTestBrowser
    (sp?), the standalone version that works without Zope) -- I'm not
    sure. (zope.testbrowser is also built on mechanize.) Despite the
    name, I'm told this can be used for scraping as well as testing.

    I would imagine that it would be fairly easy to modify or extend
    Selenium IDE to emit mechanize or twill or zope.testbrowser (etc.)
    code (perhaps without any coding, I used too many Firefox Selenium
    plugins and now forget which had which features). Personally I would
    avoid using Selenium itself to actually automate tasks, though, since
    unlike mechanize &c., Selenium drags in an entire browser, which
    brings with it some inflexibility (though not as bad as in the past).
    It does have advantages though: most obviously, it knows JavaScript.


    John
     
    John J. Lee, Sep 17, 2006
    #2
    1. Advertising

  3. Seymour

    John J. Lee Guest

    "Seymour" <> writes:
    [...]
    > struggling otherwise and have been trying to learn how to program the
    > Mechanoid module (http://cheeseshop.python.org/pypi/mechanoid) to get
    > past the password protected site hurdle.
    >
    > My questions are:
    > 1. Is there an easier way to grab these pages from a password protected
    > site, or is the use of Mechanoid a reasonable approach?

    [...]

    Again, can't speak for mechanoid, but it should be straightforward
    with mechanize (simplifiying one of the examples from the URL below):


    http://wwwsearch.sourceforge.net/mechanize/

    br = Browser()
    br.add_password("http://example.com/protected/", "joe", "password")
    br.set_debug_http(True) # Print HTTP headers.
    br.open("http://www.example.com/protected/blah.html")
    print br.response().read()


    John
     
    John J. Lee, Sep 17, 2006
    #3
  4. Seymour

    John J. Lee Guest

    "Seymour" <> writes:

    > I am trying to find a way to sign onto my Wall Street Journal account
    > (http://online.wsj.com/public/us) and automatically download various
    > financial pages on stocks and mutual funds that I am interested in
    > tracking. I have a subscription to this site and am trying to figure
    > out how to use python, which I have been trying to learn for the past
    > year, to automatically login and capture a few different pages.

    [...]

    Just to add: It's quite possible that site has an "no scraping"
    condition in their terms of use. It seems standard legal boilerplate
    on commercial sites these days. Not a good thing on the whole, I tend
    to think, but you should be aware of it.


    John
     
    John J. Lee, Sep 17, 2006
    #4
  5. Seymour

    Seymour Guest

    Thanks John!
    Lots of great leads in your post that I am busy looking at. I did try
    one program, MaxQ, that records web surfing. It seems to work great.
    I have looked at all of your leads and plan to give them all a try.
    BTW, I am not sure how I came accross Mechanoid before Mechanize, but I
    did and started to study that. Somehow I had the notion that
    Mechanize was a Pearl script.
    Thanks again,
    Seymour





    John J. Lee wrote:
    > "Seymour" <> writes:
    >
    > > I am trying to find a way to sign onto my Wall Street Journal account
    > > (http://online.wsj.com/public/us) and automatically download various
    > > financial pages on stocks and mutual funds that I am interested in
    > > tracking. I have a subscription to this site and am trying to figure
    > > out how to use python, which I have been trying to learn for the past
    > > year, to automatically login and capture a few different pages.

    > [...]
    >
    > Just to add: It's quite possible that site has an "no scraping"
    > condition in their terms of use. It seems standard legal boilerplate
    > on commercial sites these days. Not a good thing on the whole, I tend
    > to think, but you should be aware of it.
    >
    >
    > John
     
    Seymour, Sep 18, 2006
    #5
  6. Seymour

    John J. Lee Guest

    "Seymour" <> writes:

    > Somehow I had the notion that Mechanize was a Pearl script.


    mechanize the Python module started as a port of Andy Lester's Perl
    module WWW::Mechanize (in turn based on Gisle Aas' libwww-perl), and
    on some very high level has "the same" conceptual interface, but most
    of the details (internal structure, features and bugs ;-) are
    different to LWP and WWW::Mechanize due to the integration with
    urllib2, httplib and friends, and with my own code. Most parts of the
    code are no longer recognisable as having originated in LWP (and of
    course, lots of it *didn't* originate there).


    John
     
    John J. Lee, Sep 19, 2006
    #6
  7. Seymour

    John J. Lee Guest

    "Seymour" <> writes:
    [...]
    > one program, MaxQ, that records web surfing. It seems to work great.

    [...]

    There are lots of such programs about (ISTR twill used to use MaxQ for
    its recording feature, but I think Titus got rid of it in favour of
    his own code, for some reason). How useful they are depends on the
    details of what you're doing: the information that goes across HTTP is
    on a fairly low level, so e.g., most obviously, you may need to be
    sending a session ID that varies per-request. Those programs usually
    have some way of dealing with that specific problem, but you may run
    into other problems that have the same origin. Don't let me put you
    off it gets your job done, but it's good to be a bit wary: All current
    web-scraping approaches using free software suck in one way or
    another.


    John
     
    John J. Lee, Sep 19, 2006
    #7
  8. Seymour

    Guest

    You can try SWExplorerAutomation (SWEA) (http:\\webunittesting.com).
    It works very well with the password protected sites. SWEA is .Net API,
    but you can use IronPython to access it.

    Seymour wrote:
    > I am trying to find a way to sign onto my Wall Street Journal account
    > (http://online.wsj.com/public/us) and automatically download various
    > financial pages on stocks and mutual funds that I am interested in
    > tracking. I have a subscription to this site and am trying to figure
    > out how to use python, which I have been trying to learn for the past
    > year, to automatically login and capture a few different pages.
    > I have mastered capturing web pages on non-password sites, but am
    > struggling otherwise and have been trying to learn how to program the
    > Mechanoid module (http://cheeseshop.python.org/pypi/mechanoid) to get
    > past the password protected site hurdle.
    >
    > My questions are:
    > 1. Is there an easier way to grab these pages from a password protected
    > site, or is the use of Mechanoid a reasonable approach?
    > 2. Is there an easy way of recording a web surfing session in Firefox
    > to see what the browser sends to the site? I am thinking that this
    > might help me better understand the Mechanoid commands, and more easily
    > program it. I do a fair amount of VBA Programming in Microsoft Excel
    > and have always found the Macro Recording feature a very useful
    > starting point which has greatly helped me get up to speed.
    >
    > Thanks for your help/insights.
    > Seymour
     
    , Sep 20, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris Welch
    Replies:
    1
    Views:
    339
    S. Justin Gengo
    Nov 25, 2003
  2. Phil D.

    Recording web page URLs

    Phil D., Dec 12, 2003, in forum: Python
    Replies:
    2
    Views:
    274
    John J. Lee
    Dec 13, 2003
  3. AAaron123
    Replies:
    1
    Views:
    297
    Cowboy \(Gregory A. Beamer\)
    Oct 14, 2008
  4. Prabhat

    Browser Capability

    Prabhat, Sep 29, 2005, in forum: ASP General
    Replies:
    3
    Views:
    140
    Danny@Kendal
    Sep 30, 2005
  5. surf
    Replies:
    2
    Views:
    127
    Aaron Baugher
    Feb 6, 2006
Loading...

Share This Page