read web page that requires javascript on client

G

Greg

Hello all, I've been trying to find a way to fetch and read a web page
that requires javascript on the client side and it seems impossible.
I've read several threads in this group that say as much but I just
can't believe it to be true (I'm subscribing to the "argument of
personal incredulity " here).

Clearly urllib and urllib2 don't seem to support this and I've looked
at win32com.client and it's ScriptControl but that doesn't seem to be
viable approach for this particular problem.

Does anyone have any suggestions, hack or ideas or am I missing
something really obvious.

Thanking you all in advance!
Greg
 
R

R. David Murray

Greg said:
Hello all, I've been trying to find a way to fetch and read a web page
that requires javascript on the client side and it seems impossible.
I've read several threads in this group that say as much but I just
can't believe it to be true (I'm subscribing to the "argument of
personal incredulity " here).

Clearly urllib and urllib2 don't seem to support this and I've looked
at win32com.client and it's ScriptControl but that doesn't seem to be
viable approach for this particular problem.

Does anyone have any suggestions, hack or ideas or am I missing
something really obvious.

Well, this is what is called a Hard Problem :). It requires not
only supporting the execution of javascript (and therefore an entire
additional language interpreter!), but also translating that
execution into something that doesn't have a browser attached to it
for input or output.

That said, I've heard mention here of something that can apparently be
used for this. I think it was some incarnation of Webkit. I remember
someone saying you wanted to use the one with, I think it was GTK
bindings, even though you were dealing with just network IO. But I don't
remember clearly and did not record the reference. Perhaps the person
who posted that info will answer you, or you will be able to figure out
from these clues. Unfortunately I'm not 100% sure it was Webkit.
 
A

Aahz

That said, I've heard mention here of something that can apparently be
used for this. I think it was some incarnation of Webkit. I remember
someone saying you wanted to use the one with, I think it was GTK
bindings, even though you were dealing with just network IO. But I don't
remember clearly and did not record the reference. Perhaps the person
who posted that info will answer you, or you will be able to figure out
from these clues. Unfortunately I'm not 100% sure it was Webkit.

By the power of Gooja!

http://groups.google.com/group/comp.lang.python/msg/aed53725885a9250
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Programming language design is not a rational science. Most reasoning
about it is at best rationalization of gut feelings, and at worst plain
wrong." --GvR, python-ideas, 2009-3-1
 
C

Carl

By the power of Gooja!

http://groups.google.com/group/comp.lang.python/msg/aed53725885a9250
--
Aahz ([email protected])           <*>        http://www.pythoncraft.com/

"Programming language design is not a rational science. Most reasoning
about it is at best rationalization of gut feelings, and at worst plain
wrong."  --GvR, python-ideas, 2009-3-1

Probably the easiest thing is to actually use a browser. There are
many examples of automating a browser via Python. So, you can
programmatically launch the browser, point it to the JavaScript
afflicted page, let the JS run and grab the page source. As an added
bonus you can later interact with the page by programatically, filling
form fields, selecting options from lists and clicking buttons.

HTH, Carl
 
G

Greg

Probably the easiest thing is to actually use a browser. There are
many examples of automating a browser via Python. So, you can
programmatically launch the browser, point it to the JavaScript
afflicted page, let the JS run and grab the page source. As an added
bonus you can later interact with the page by programatically, filling
form fields, selecting options from lists and clicking buttons.

HTH, Carl

Selenium. It's not pretty for what I want to do but it works ... then
again, what I need to do is not pretty either.
Ciao,
Greg
 
L

lkcl

Hello all, I've been trying to find a way to fetch and read a web page
that requires javascript on the client side and it seems impossible.

you're right: it's not impossible.
I've read several threads in this group that say as much but I just
can't believe it to be true

you're right: it's not true.
(I'm subscribing to the "argument of
personal incredulity " here).

there are several approaches that you can take that combine python
and javascript: none of them are at the level of "simplicity" which
you and many others may be expecting, which is why it's believed to be
"impossible" or "not achievable".

they all have different advantages and disadvantages - don't be
surprised if you end up with 30 mb of binaries on your system, _just_
to support the features you're implicitly asking for, ok?

here's the approaches i've found so far:

1) python-spidermonkey

python-spidermonkey "rips out" the mozilla javascript engine and
provides you with a hybrid mechanism where the execution context can
be shared between the two languages.

in other words, variables and functions can be shoved into the
namespace of the spidermonkey javascript context and executed; python
can likewise (in a rather clunky way at the moment) gain access to the
execution context and "call in".

what this approach does NOT have is the "DOM model" functions. those
have been REMOVED as they are ONLY part of the W3C specification for
implementation of web browsers, NOT the ECMAScript specification.

2) PyV8 - http://code.google.com/p/pyv8

take 1) above, and sed -e "s/python-spidermonkey/pyv8/g"

flier liu, the author of pyv8, is actually _doing_ what you want to
do. namely, he's started with a combination of python plus google's
V8 javascript engine, and he's now moving on to implementing the DOM
as *python*, for execution as a python console-only application.

he recognises the need for execution of javascript, as part of the
requirement, and that's the reason why he has added google v8.

by doing this "hybrid", he will be able to "add" a global variable
called "document" to the javascript context, and another global
variable called "window" to the javascript context, etc. etc. and then
"execution" of the javascript will result in callbacks - into python -
to emulate, in its entirety, the complete W3C DOM model standard.

far be it for me to tell him how monstrously large the task of
reimplementing the W3C DOM standard in python, i urge you to consider
helping him out with his project.

3) pywebkitgtk (+patch #13) + webkit-glib/gdom (+patch #16401)

this one's a whopping-great project that takes the ENTIRE webkit
engine, patched to include glib / gobject bindings, so that python can
"get at" the DOM model, directly.

you can use this to "execute" a web page - bear in mind that GTK apps
do NOT have to be "visible" - you CAN "run" a GTK app WITHOUT actually
putting up an on-screen GUI widget.

in this way, you will be able to "load" a web page, have it be
"executed", and then, after a specific and arbitrary amount of time,
run some python using the python-bindings to the DOM model to either
"walk" the DOM model or just call the "toString()" method and obtain a
flat HTML representation of the entire page.

CAVEATS: apple's employees are flexing their muscles and are
unfortunately showing that they have power and control by limiting the
functionality of the glib / gobject bindings to "that which they deem
to be acceptable". apple's employees have deemed that strict
compliance to the W3C standard is how they want things to be, and are
ignoring the fact that the de-facto standard is actually that
specified by Javascript implementations.

in other words, toString, being a de-facto standard, is "unacceptable"
to them, as are a couple of other things.

4) python-hulahop

exactly the same as 3) except using mozilla not webkit: hulahop is
the ENTIRE gecko engine, with python bindings via the XUL interface.
the hulahop team are the ONLY people who have been able to understand
the obtuse XUL interface enough to be able to make python bindings
actually _work_ :)

it's clear that the OLPC / SUGAR team looked at webkit, initially, and
loved it. however, they saw the lack of glib/gobject bindings, and
the lack of python bindings, and freaked out (whereas i, rather
stupidly, went "nooo problem saah!" and _added_ glib / gobject
bindings to webkit)

so they then went "ahhhh, safety", abandoned webkit and made a beeline
for XUL.

so they have complete and total control over the DOM model, from
python, including (thanks to gecko's ability to execute javascript
using spidermonkey) the ability to interact two-way with javascript
(exactly as can be done with webkit's glib/gobject + pywebkitgtk
bindings).

so - _again_ - you have the choice of being able to run a GTK app -
without an actual "window" - load up a web page and then tell the
XUL / Gecko engine "GO! EXECUTE JAVASCRIPT!", and then, at some point
in the future, walk the DOM model using the python XUL bindings or
call the document.toString() method, from python, and obtain the
resultant HTML.


so - the answer to your question is: yes, it's technically possible.
and yes, it's even been done (twice). successfully. in two separate
and distinct ways, with at least a third in active development that i
know of, and a fourth method as a possible candidate for the basis of
a fourth alternative.

but i have to warn you - these are _not_ small projects: you're
relying on and leveraging the expertise of e.g. Webkit means that
you're backed by MAN CENTURIES of effort ( see the statistics e.g. on
http://www.ohloh.net/p/WebKit : an estimated 480 man-years of time
spent so far - if you look at mozilla you'll find it's a similar
amount )

l.
 
L

lkcl

Well, this is what is called a Hard Problem :). It requires not
only supporting the execution ofjavascript(and therefore an entire
additional language interpreter!), but also translating that
execution into something that doesn't have a browser attached to it
for input or output.

That said, I've heard mention here of something that can apparently be
used for this. I think it was some incarnation of Webkit.

yep. patch #16401 - don't use the cut-down version that the other
company who are doing "vala" bindings are using - use the version
that i've worked on, until they support the "full" DOM bindings.
better yet, just grab the code from the git repository i'm maintaining
- http://github.com/lkcl/webkit/tree/16401.master

I remember
someone saying

yep, it was me :)
you wanted to use the one with, I think it was GTK
bindings, even though you were dealing with just network IO. But I don't
remember clearly and did not record the reference. Perhaps the person
who posted that info will answer you,

i do searches for the words "ajax" and "javascript" and "pyjamas"
using groups.google.com occasionally, and pick things up -
eventually :)

Unfortunately I'm not 100% sure it was Webkit.

it was - however i've since found three other projects (including
python-hulahop, the best other alternate candidate).

l.
 
L

lkcl

On Mar 18, 1:56 pm, (e-mail address removed) (Aahz) wrote:
Selenium. It's not pretty for what I want to do but it works ... then
again, what I need to do is not pretty either.
Ciao,
Greg

http://seleniumhq.org/projects/remote-control/languages/python.html


intriguing. five solutions. although, using selenium forces you to
actually have the full complete firefox web browser, including running
the GUI itself, and then having a plugin actually in the browser
itself.

which is rather borked - but works.

l.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,588
Members
45,100
Latest member
MelodeeFaj
Top