RE: Parsing/Crawler Questions - solution

Discussion in 'Python' started by bruce, Mar 7, 2009.

  1. bruce

    bruce Guest

    .....

    and this solution will somehow allow a user to create a web parsing/scraping
    app for parising links, and javascript from a web page?


    -----Original Message-----
    From: python-list-bounces+bedouglas=
    [mailto:python-list-bounces+bedouglas=]On Behalf
    Of lkcl
    Sent: Saturday, March 07, 2009 2:34 AM
    To:
    Subject: Re: Parsing/Crawler Questions - solution


    On Mar 7, 12:19 am, wrote:
    > So, it sounds like your update means that it is related to a specific
    > url.
    >
    > I'm curious about this issue myself. I've often wondered how one
    > could properly crawl anAJAX-ish site when you're not sure how quickly
    > the data will be returned after the page has been.


    you want to look at the webkit engine - no not the graphical browser
    - the ParseTree example - and combine it with pywebkitgtk - no not the
    "original" version, the one which has DOM-manipulation bindings
    through webkit-glib.

    the webkit parse tree example is, despite it being based on the GTK
    "port" as they like to call it in webkit (which just means that it
    links with GTK not QT4 or wxWidgets), is a console-based application.

    in other words, despite it being GTK, it still does NOT output
    graphical crap to the screen, yet it still *executes* the javascript
    on the page.

    dummy functions for "mouse", "keyboard", "console errors" are given as
    examples and are left as an exercise for the application writer to
    fill-in-the-blanks.

    combining this parse tree example with pywebkitgtk (see
    demobrowser.py) would provide a means by which web pages can be
    executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
    gobject bindings, a python app will be able to walk the DOM tree as
    expected.

    i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
    for someone, on the pyjamas-dev mailing list.


    http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2
    dd014540

    so, actually, you may be better off starting from pyjamas-desktop and
    then cutting out the "fire up the GTK window" bit, from pyjd.py.

    pyjd.py is based on pywebkitgtk's demobrowser.py

    the alternative to webkit is to use python-hulahop - it will do the
    same thing, but just using python bindings to gecko instead of python-
    bindings-to-glib-bindings-to-webkit.


    l.
    --
    http://mail.python.org/mailman/listinfo/python-list
    bruce, Mar 7, 2009
    #1
    1. Advertising

  2. bruce

    lkcl Guest

    On Mar 7, 9:56 pm, "bruce" <> wrote:
    > ....
    >
    > and this solution will somehow allow a user to create a web parsing/scraping
    > app for parising links, and javascript from a web page?



    not just parsing the links and the "static" javascript, but:

    * actually executing the javascript, giving the quotes page quotes a
    chance to actually _look_ like it would if it was being viewed as a
    quotes real quotes web browser.

    so any XMLHTTPRequests will _actually_ get executed, _actually_
    result in _actually_ having the content of the web page _properly_
    modified.

    so, e.g instead of seeing a "Loader" page on gmail you would
    _actually_ see the user's email and the adverts (assuming you went to
    the trouble of putting in the username/password) because the AJAX
    would _actually_ get executed by the WebKit engine, and the DOM model
    accessed thereafter.


    * giving the user the opportunity to call DOM methods such as
    getElementsByTagName and the opportunity to access properties such as
    document.anchors.

    in webkit-glib "gdom" bindings, that would be:

    * anchor_list = gdom_document_get_elements_by_tag_name(doc, "a");

    or

    * g_object_get(doc, "anchors", &anchor_list, NULL);

    which in pywebkitgtk (thanks to python-pygobject auto-generation of
    python bindings from gobject bindings) translates into:

    * doc.get_elements_by_tag_name("a")

    or

    * doc.props.anchors

    which in pyjamas-desktop, a high-level abstraction on top of _that_,
    turns into:

    * from pyjamas import DOM
    anchor_list = DOM.getElementsByTagName(doc, "a")

    or

    * from pyjamas import DOM
    anchor_list = DOM.getAttribute(doc, "anchors")

    answer: yes.

    l.

    > -----Original Message-----
    > From: python-list-bounces+bedouglas=
    >
    > [mailto:python-list-bounces+bedouglas=]On Behalf
    > Oflkcl
    > Sent: Saturday, March 07, 2009 2:34 AM
    > To:
    > Subject: Re: Parsing/Crawler Questions - solution
    >
    > On Mar 7, 12:19 am, wrote:
    > > So, it sounds like your update means that it is related to a specific
    > > url.

    >
    > > I'm curious about this issue myself. I've often wondered how one
    > > could properly crawl anAJAX-ish site when you're not sure how quickly
    > > the data will be returned after the page has been.

    >
    > you want to look at the webkit engine - no not the graphical browser
    > - the ParseTree example - and combine it with pywebkitgtk - no not the
    > "original" version, the one which has DOM-manipulation bindings
    > through webkit-glib.
    >
    > the webkit parse tree example is, despite it being based on the GTK
    > "port" as they like to call it in webkit (which just means that it
    > links with GTK not QT4 or wxWidgets), is a console-based application.
    >
    > in other words, despite it being GTK, it still does NOT output
    > graphical crap to the screen, yet it still *executes* the javascript
    > on the page.
    >
    > dummy functions for "mouse", "keyboard", "console errors" are given as
    > examples and are left as an exercise for the application writer to
    > fill-in-the-blanks.
    >
    > combining this parse tree example with pywebkitgtk (see
    > demobrowser.py) would provide a means by which web pages can be
    > executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
    > gobject bindings, a python app will be able to walk the DOM tree as
    > expected.
    >
    > i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
    > for someone, on the pyjamas-dev mailing list.
    >
    > http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3c...
    > dd014540
    >
    > so, actually, you may be better off starting from pyjamas-desktop and
    > then cutting out the "fire up the GTK window" bit, from pyjd.py.
    >
    > pyjd.py is based on pywebkitgtk's demobrowser.py
    >
    > the alternative to webkit is to use python-hulahop - it will do the
    > same thing, but just using python bindings to gecko instead of python-
    > bindings-to-glib-bindings-to-webkit.
    >
    > l.
    > --http://mail.python.org/mailman/listinfo/python-list
    lkcl, Mar 8, 2009
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nicolas
    Replies:
    0
    Views:
    351
    Nicolas
    Jan 28, 2005
  2. Bill
    Replies:
    3
    Views:
    4,132
    Karl Seguin
    Dec 13, 2005
  3. Paul Morrison

    Web Crawler

    Paul Morrison, Oct 17, 2005, in forum: Java
    Replies:
    3
    Views:
    4,911
    lamantpirate
    Jun 30, 2012
  4. HTML crawler/parser

    , Sep 14, 2005, in forum: HTML
    Replies:
    1
    Views:
    454
    David Dorward
    Sep 14, 2005
  5. bruce

    Parsing/Crawler Questions..

    bruce, Mar 4, 2009, in forum: Python
    Replies:
    0
    Views:
    240
    bruce
    Mar 4, 2009
Loading...

Share This Page