question about urllib and parsing a page

N

nephish

hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks
 
M

matt

Yeah, this tends to be silly, but a workaround (for firefox at least)
is to select the content and rather than saying view source, right
click and click View Selection Source...
 
N

nephish

thats cool, but i want to do this automatically with python.
what can i do to have urllib download the source with the numbers in
it?

ok, not necessarily urllib, whatever one is best for the occation
thanks
shawn
 
D

David Wahler

hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks

If the Javascript is automatically generated by the server with the
numbers in a known location, you can use a regular expression to
extract them. For example, if there's something in the code like:

var numbersToDisplay = [123,456,789];

Then you could use: (warning, this is not fully tested):

import re
js_source = "... the source inside the <script> tag ..."
numbers_str = re.search(r'numbersToDisplay = \[([^]]*)\];', \
js_source).group(1)
numbers_list = numbers_str.split(",")

You'll obviously have to vary this to match your particular script.
Bear in mind that this won't work if the values are computed in
JavaScript, instead of on the server. If that's the case, then unless
you feel like implementing a complete IE- and Mozilla-compatible
browser DOM and JavaScript interpreter, you're out of luck.

-- David
 
N

nephish

well, i think thats the case, looking at the code, there is a long
string of math functions in page, java math functions. hmmmm. i guess
i'm up that famous creek.
thanks for the info, though
shawn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top