Web page data and urllib2.urlopen

Massi · Aug 5, 2009

Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

Dave Angel · Aug 6, 2009

Massi said:
Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

I don't think this is a Python issue, but a "raw read" versus an
interactive interpretation of a page. The browser does lots more than a
single roundtrip defined by urlopen/read.

I also would love to see some explanation of what happens here, or a
pointer to a reference that would help me understand it.

I took the output of the read(), and formatted it, roughly, as html. I
expected to find a refresh, which is the simplest way that one page can
cause a very different one to be loaded.
<meta http-equiv="refresh" content="1;url=someotherurl" />

If Mozilla had seen a page with this line in an appropriate place, it'd
immediately begin loading the other page, at "someotherurl" But there's
no such line.

Next, I looked for javascript. The Mozilla page contains lots of
javascript, but there's none in the raw page. So I can't explain
Mozilla's differences that way.

I did notice the link to /m/Content/mobile2.css, but I don' t know any
way a CSS file could cause the content to change, just the display.

All I can guess is that it has something to do with "browser type" or
cookies. And that would make lots of sense if this was a cgi page. But
the URL doesn't look like that, as it doesn't end in pl, py, asp, or any
of another dozen special suffixes.

Any hints, anybody???

DaveA

ryles · Aug 6, 2009

Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

Check if setting your user agent to Mozilla results in a different
page:

http://diveintopython.org/http_web_services/user_agent.html

Piet van Oostrum · Aug 6, 2009

Dave Angel said:
DA> I don't think this is a Python issue, but a "raw read" versus an
DA> interactive interpretation of a page. The browser does lots more than a
DA> single roundtrip defined by urlopen/read.

DA> I also would love to see some explanation of what happens here, or a
DA> pointer to a reference that would help me understand it.

DA> I took the output of the read(), and formatted it, roughly, as html. I
DA> expected to find a refresh, which is the simplest way that one page can
DA> cause a very different one to be loaded.
DA> <meta http-equiv="refresh" content="1;url=someotherurl" />

DA> If Mozilla had seen a page with this line in an appropriate place, it'd
DA> immediately begin loading the other page, at "someotherurl" But there's no
DA> such line.

DA> Next, I looked for javascript. The Mozilla page contains lots of
DA> javascript, but there's none in the raw page. So I can't explain Mozilla's
DA> differences that way.

DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
DA> a CSS file could cause the content to change, just the display.

DA> All I can guess is that it has something to do with "browser type" or
DA> cookies. And that would make lots of sense if this was a cgi page. But
DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
DA> another dozen special suffixes.

DA> Any hints, anybody???

If you look into the HTML that Firefox gets, there is a lot of
javascript in it.

Dave Angel · Aug 6, 2009

Piet said:
If you look into the HTML that Firefox gets, there is a lot of
javascript in it.

But the raw page didn't have any javascript. So what about that
original raw page triggered additional stuff to be loaded?
Is it "user agent", as someone else brought out? And is there somewhere
I can read more about that aspect of things? I've mostly built very
static html pages, where the server yields the same page to everybody.
And some form stuff, where the user clicks on a 'submit" button to
trigger a script that's not shown on the URL line.

Kushal Kumaran · Aug 7, 2009

Note that the URL does not have to have any special suffix for it to
be dynamically generated. See any page at wikipedia, for example.
Mediawiki, the software running the site, is a php application.

But the raw page didn't have any javascript. Â So what about that original
raw page triggered additional stuff to be loaded?

FWIW, I'm getting a ton of javascript in the page downloaded using
your code fragment.

Piet van Oostrum · Aug 7, 2009

Dave Angel said:
DA> If Mozilla had seen a page with this line in an appropriate place, it'd
DA> immediately begin loading the other page, at "someotherurl" But there's no
DA> such line.
DA> Next, I looked for javascript. The Mozilla page contains lots of
DA> javascript, but there's none in the raw page. So I can't explain Mozilla's
DA> differences that way.
DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
DA> a CSS file could cause the content to change, just the display.
DA> All I can guess is that it has something to do with "browser type" or
DA> cookies. And that would make lots of sense if this was a cgi page. But
DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
DA> another dozen special suffixes.

DA> But the raw page didn't have any javascript. So what about that original
DA> raw page triggered additional stuff to be loaded?
DA> Is it "user agent", as someone else brought out? And is there somewhere I
DA> can read more about that aspect of things? I've mostly built very static
DA> html pages, where the server yields the same page to everybody. And some
DA> form stuff, where the user clicks on a 'submit" button to trigger a script
DA> that's not shown on the URL line.

Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener()
page = opener.open(request).read()
print page

Dave Angel · Aug 7, 2009

Piet said:
Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener()
page = opener.open(request).read()
print page

Thanks much. That's a key I didn't understand.

DaveA

Piet van Oostrum · Aug 7, 2009

Dave Angel said:
DA> But the raw page didn't have any javascript. So what about that original
DA> raw page triggered additional stuff to be loaded?
DA> Is it "user agent", as someone else brought out? And is there somewhere I
DA> can read more about that aspect of things? I've mostly built very static
DA> html pages, where the server yields the same page to everybody. And some
DA> form stuff, where the user clicks on a 'submit" button to trigger a script
DA> that's not shown on the URL line.
DA> Thanks much. That's a key I didn't understand.

You can even specify the headers in the Request constructor:

url = 'http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27'
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13'}
request = urllib2.Request(url = url, headers = hdr)

Problem when fetching page using urllib2.urlopen	6	Aug 10, 2009
help on HTTP 400 Bad Request syntax error on urllib2.urlopen	0	Jan 10, 2012
difference between urllib2.urlopen and firefox view 'page source'?	5	Mar 20, 2007
Sending data from web page to Raspberry Pi	0	Nov 26, 2022
Urllib2: Only a partial page retrieved	4	May 22, 2010
urllib2 and threading	6	May 1, 2009
[urllib2] No time-out?	1	Nov 16, 2008
urllib2.urlopen(url) pulling something other than HTML	7	Aug 20, 2007

Web page data and urllib2.urlopen

Massi

Dave Angel

ryles

Piet van Oostrum

Dave Angel

Kushal Kumaran

Piet van Oostrum

Dave Angel

Piet van Oostrum

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads