Inconsistent result from urllib.urlopen

junkdump2861 · Apr 12, 2007

Here's the problem: using Netscape 7.1, I type use the view page
source command (url is http://en.wikipedia.org/wiki/Cain) and save the
raw HTML file and it's 67 kb, and has the addresses of all the images
in it. I want the exact same thing from my Python script, but I'm not
getting it. Instead, I get a file only 21 kb that has no image
addresses. Here's the code I use:

import urllib
f = urllib.urlopen('http://en.wikipedia.org/wiki/Cain')
data = f.read(9999999)
f.close()
f1 = open('junk.txt', 'w')
f1.write(data)
f1.close()

Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.

Laszlo Nagy · Apr 12, 2007

Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.

Maybe it has to do something with your user agent string. The server
side can decide to return a different content when your user agent is
not 'mozilla', 'internet explorer' or 'opera' etc.

Do you want to know how to change your user agent string? Google for
it....

Laszlo

Gabriel Genellina · Apr 12, 2007

En Thu said:
Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.

The server (that is, Wikipedia) may choose to send a different response
based on the User-Agent header you provide.

Facundo Batista · Apr 12, 2007

import urllib
f = urllib.urlopen('http://en.wikipedia.org/wiki/Cain')
data = f.read(9999999)
f.close()
f1 = open('junk.txt', 'w')
f1.write(data)
f1.close()

Did you see the file "junk.txt"? It's an error page from Wikipedia, not
the actual content page...

Regards,

junkdump2861 · Apr 13, 2007

Laszlo said:
Maybe it has to do something with your user agent string. The server
side can decide to return a different content when your user agent is
not 'mozilla', 'internet explorer' or 'opera' etc.

Do you want to know how to change your user agent string? Google for
it....

Laszlo

Thanks. That is the fix I needed. I added

urllib.URLopener.version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)'

as the second line of code and now it is actually getting content, not
just an error message. It's not the exact same format as you get from
saving the page from the web browser, but all the links and image
addresses are in place.

urllib.urlopen blocking?	3	May 11, 2010
Send array back in result from urllib2.urlopen(request, postData)	5	Jan 10, 2014
Return a value from a function result	0	Apr 11, 2013
multi-result set MySQLdb queries.	0	Feb 7, 2013
Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()	9	May 27, 2013
Using Xpath to parse a Yahoo Finance page	4	Dec 3, 2012
Need help with this script	4	Mar 12, 2023
Python and SSL enabled	6	Oct 24, 2006

Inconsistent result from urllib.urlopen

junkdump2861

Laszlo Nagy

Gabriel Genellina

Facundo Batista

junkdump2861

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads