Python- javascript

M

Mike Paul

I'm trying to scrap a dynamic page with lot of javascript in it.
Inorder to get all the data from the page i need to access the
javascript. But i've no idea how to do it.

Say I'm scraping some site htttp://www.xyz.com/xyz

request=urllib2.Request("htttp://www.xyz.com/xyz")
response=urllib2.urlopen(request)
data=response.read()


So i get all the data on the initial page. Now i need to access the
javascript on this page to get additional details. I've heard someone
telling me to use spidermonkey. But no idea on how to send javscript
as request and get the response. How hsuld i be sending the javascript
request as ? how can it be sent?

This is the script I need to access.

<a href=# onclick="groups.render_box('data', {&quot;page&quot;:
2});return false;">Click this to view more items</a>

Can anyone tell me how can i do it very clearly. I've been breaking my
head into this for the past few days with no progress.


Thanks
 
D

Douglas Alan

I'm trying to scrap a dynamic page with lot of javascript in it.
Inorder to get all the data from the page i need to access the
javascript. But i've no idea how to do it.

I'm not sure exactly what you are trying to do, but scraping websites
that use a lot of JavaScript are often very problematic. The last time
I did so, I had to write some pretty funky regular expressions to pick
data out of the JavaScript. Fortunately, the data was directly in the
JavaScript, rather than me having to reproduce the Ajax calling
chain. If you need to do that, then you almost certainly want to use
a package designed for doing such things. One such package is
HtmlUnit. It is a "GUI-less browser" with a built-in JavaScript engine
that is design for such scraping tasks.

Unfortunately, you have to program it in Java, rather than Python.
(You might be able to use Jython instead of Java, but I don't know for
sure.)

|>ouglas


P.S. For scraping tasks, you probably want to use BeautifulSoup rather
than urllib2.
 
M

Michel Claveau - MVP

Hi!

If you are under Windows, you can drive IE, for indirect drive the web-pages.
In this case, you can then interact with pages & the javascript's scripts included.

For more, see Pywin32, Pamie, Pxie, etc.

@-salutations
 
L

lkcl

I'm trying to scrap a dynamic page with lot ofjavascriptin it.
Inorder to get all the data from the page i need to access thejavascript. But i've no idea how to do it.

Say I'm scraping some site htttp://www.xyz.com/xyz

request=urllib2.Request("htttp://www.xyz.com/xyz")
response=urllib2.urlopen(request)
data=response.read()

So i get all the data on the initial page. Now i need to access thejavascripton this page to get additional details. I've heard someone
telling me to use spidermonkey. But no idea on how to send javscript
as request and get the response. How hsuld i be sending thejavascript
request as ? how can it be sent?

you need to actually _execute_ the web page under a browser engine.

you will not be able to do what you want, using urllib.

Can anyone tell me how can i do it very clearly. I've been breaking my
head into this for the past few days with no progress.

there are about four or five engines that you can use, depending on
the target platform.

see: http://wiki.python.org/moin/WebBrowserProgramming


1) python-khtml (pykhtml).
2) pywebkitgtk (with DOM / glib-gobject bindings patches)
3) python-hulahop and xulrunner
4) Trident (the MSHTML engine behind IE) accessed through python
comtypes
5) macosx objective-c bindings and use pyobjc from there.

options 2-4 i have successfully used and proven that it can be done:

http://pyjamas.svn.sourceforge.net/viewvc/pyjamas/trunk/pyjd/

option 1) i haven't done due to an obscure bug in the KDE KHTML python-
c++ bindings; option 2) i haven't done because there's no point:
XMLHttpRequest has been deliberately excluded due to short-sightedness
of the webkit developers, which has only recently been corrected (but
the work still needs to be done).

so, using a web browser engine, you must load and execute the page,
and then you can use DOM manipulation to extract the web page text,
after a certain amount of time has elapsed, and the javascript has
completed execution.

if you _really_ want to create your own javascript execution engine,
which, my god it will be a hell of a lot of work but would be
extremely beneficial, you would do well to help flier liu with pyv8,
and paul bonser with pybrowser.

flier is doing a web-site-scraping system, browsing millions of pages
and executing the javascript under pyv8. paul is implementing a web
browser in pure python (using python-cairo as the graphics engine).
he's got part-way through the project, having focussed initially on a
W3C standards-compliant implementation of the DOM, and less on the
graphics side. that means that what paul has will be somewhat more
suited to what you need, because you don't want a graphics engine at
all.

if paul's work isn't suitable, then the above engines you will simply
have to run _without_ firing up the actual GUI window. in the GTK-
based engines, you just... don't call show() or show_all(); in the
MSHTML-based one, i presume you just don't fire a WM_SHOW event at it.

you'll work it out.

l.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,216
Latest member
topweb3twitterchannels

Latest Threads

Top