Inconsistent result from urllib.urlopen

J

junkdump2861

Here's the problem: using Netscape 7.1, I type use the view page
source command (url is http://en.wikipedia.org/wiki/Cain) and save the
raw HTML file and it's 67 kb, and has the addresses of all the images
in it. I want the exact same thing from my Python script, but I'm not
getting it. Instead, I get a file only 21 kb that has no image
addresses. Here's the code I use:

import urllib
f = urllib.urlopen('http://en.wikipedia.org/wiki/Cain')
data = f.read(9999999)
f.close()
f1 = open('junk.txt', 'w')
f1.write(data)
f1.close()

Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.
 
L

Laszlo Nagy

Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.
Maybe it has to do something with your user agent string. The server
side can decide to return a different content when your user agent is
not 'mozilla', 'internet explorer' or 'opera' etc.

Do you want to know how to change your user agent string? Google for
it.... :)

Laszlo
 
G

Gabriel Genellina

Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.

The server (that is, Wikipedia) may choose to send a different response
based on the User-Agent header you provide.
 
J

junkdump2861

Laszlo said:
Maybe it has to do something with your user agent string. The server
side can decide to return a different content when your user agent is
not 'mozilla', 'internet explorer' or 'opera' etc.

Do you want to know how to change your user agent string? Google for
it.... :)

Laszlo

Thanks. That is the fix I needed. I added

urllib.URLopener.version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)'

as the second line of code and now it is actually getting content, not
just an error message. It's not the exact same format as you get from
saving the page from the web browser, but all the links and image
addresses are in place.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,067
Latest member
HunterTere

Latest Threads

Top