Web page data and urllib2.urlopen

M

Massi

Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.
 
D

Dave Angel

Massi said:
Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.
I don't think this is a Python issue, but a "raw read" versus an
interactive interpretation of a page. The browser does lots more than a
single roundtrip defined by urlopen/read.

I also would love to see some explanation of what happens here, or a
pointer to a reference that would help me understand it.

I took the output of the read(), and formatted it, roughly, as html. I
expected to find a refresh, which is the simplest way that one page can
cause a very different one to be loaded.
<meta http-equiv="refresh" content="1;url=someotherurl" />

If Mozilla had seen a page with this line in an appropriate place, it'd
immediately begin loading the other page, at "someotherurl" But there's
no such line.

Next, I looked for javascript. The Mozilla page contains lots of
javascript, but there's none in the raw page. So I can't explain
Mozilla's differences that way.

I did notice the link to /m/Content/mobile2.css, but I don' t know any
way a CSS file could cause the content to change, just the display.

All I can guess is that it has something to do with "browser type" or
cookies. And that would make lots of sense if this was a cgi page. But
the URL doesn't look like that, as it doesn't end in pl, py, asp, or any
of another dozen special suffixes.

Any hints, anybody???

DaveA
 
R

ryles

Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

Check if setting your user agent to Mozilla results in a different
page:

http://diveintopython.org/http_web_services/user_agent.html
 
P

Piet van Oostrum

Dave Angel said:
DA> I don't think this is a Python issue, but a "raw read" versus an
DA> interactive interpretation of a page. The browser does lots more than a
DA> single roundtrip defined by urlopen/read.
DA> I also would love to see some explanation of what happens here, or a
DA> pointer to a reference that would help me understand it.
DA> I took the output of the read(), and formatted it, roughly, as html. I
DA> expected to find a refresh, which is the simplest way that one page can
DA> cause a very different one to be loaded.
DA> <meta http-equiv="refresh" content="1;url=someotherurl" />
DA> If Mozilla had seen a page with this line in an appropriate place, it'd
DA> immediately begin loading the other page, at "someotherurl" But there's no
DA> such line.
DA> Next, I looked for javascript. The Mozilla page contains lots of
DA> javascript, but there's none in the raw page. So I can't explain Mozilla's
DA> differences that way.
DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
DA> a CSS file could cause the content to change, just the display.
DA> All I can guess is that it has something to do with "browser type" or
DA> cookies. And that would make lots of sense if this was a cgi page. But
DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
DA> another dozen special suffixes.
DA> Any hints, anybody???

If you look into the HTML that Firefox gets, there is a lot of
javascript in it.
 
D

Dave Angel

Piet said:
If you look into the HTML that Firefox gets, there is a lot of
javascript in it.

But the raw page didn't have any javascript. So what about that
original raw page triggered additional stuff to be loaded?
Is it "user agent", as someone else brought out? And is there somewhere
I can read more about that aspect of things? I've mostly built very
static html pages, where the server yields the same page to everybody.
And some form stuff, where the user clicks on a 'submit" button to
trigger a script that's not shown on the URL line.
 
K

Kushal Kumaran

Note that the URL does not have to have any special suffix for it to
be dynamically generated. See any page at wikipedia, for example.
Mediawiki, the software running the site, is a php application.
But the raw page didn't have any javascript.  So what about that original
raw page triggered additional stuff to be loaded?

FWIW, I'm getting a ton of javascript in the page downloaded using
your code fragment.
 
P

Piet van Oostrum

Dave Angel said:
DA> If Mozilla had seen a page with this line in an appropriate place, it'd
DA> immediately begin loading the other page, at "someotherurl" But there's no
DA> such line.
DA> Next, I looked for javascript. The Mozilla page contains lots of
DA> javascript, but there's none in the raw page. So I can't explain Mozilla's
DA> differences that way.
DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
DA> a CSS file could cause the content to change, just the display.
DA> All I can guess is that it has something to do with "browser type" or
DA> cookies. And that would make lots of sense if this was a cgi page. But
DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
DA> another dozen special suffixes.
DA> But the raw page didn't have any javascript. So what about that original
DA> raw page triggered additional stuff to be loaded?
DA> Is it "user agent", as someone else brought out? And is there somewhere I
DA> can read more about that aspect of things? I've mostly built very static
DA> html pages, where the server yields the same page to everybody. And some
DA> form stuff, where the user clicks on a 'submit" button to trigger a script
DA> that's not shown on the URL line.

Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener()
page = opener.open(request).read()
print page
 
D

Dave Angel

Piet said:
Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener()
page = opener.open(request).read()
print page
Thanks much. That's a key I didn't understand.

DaveA
 
P

Piet van Oostrum

Dave Angel said:
DA> But the raw page didn't have any javascript. So what about that original
DA> raw page triggered additional stuff to be loaded?
DA> Is it "user agent", as someone else brought out? And is there somewhere I
DA> can read more about that aspect of things? I've mostly built very static
DA> html pages, where the server yields the same page to everybody. And some
DA> form stuff, where the user clicks on a 'submit" button to trigger a script
DA> that's not shown on the URL line.
DA> Thanks much. That's a key I didn't understand.

You can even specify the headers in the Request constructor:


url = 'http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27'
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13'}
request = urllib2.Request(url = url, headers = hdr)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top