urllib behaves strangely

Gabriel Zachmann · Jun 12, 2006

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()

However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

Benjamin Niemann · Jun 12, 2006

Gabriel said:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()

However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b******t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.

Benjamin Niemann · Jun 12, 2006

Benjamin said:
The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.

Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...

Try the URI
'http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological',

You may try this anyway...

John Hicken · Jun 12, 2006

Gabriel said:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()

However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

I think the problem might be with the Wikimedia Commons website itself,
rather than urllib. Wikipedia has a policy against unapproved bots:
http://en.wikipedia.org/wiki/Wikipedia:Bots

It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot. I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried. Also, the html the program returns seems to be a
standard "ACCESS DENIED" page.

I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.

John Hicken

Duncan Booth · Jun 12, 2006

Gabriel said:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()

However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
....

That (or code very like it) worked when I tried it.

John J. Lee · Jun 12, 2006

Duncan Booth said:
Gabriel said:

Here is a very simple Python script utilizing urllib: [...]
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url ) [...]
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl. [...]

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Click to expand...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)

John

Duncan Booth · Jun 13, 2006

John said:
It looks like wikipedia checks the User-Agent header and refuses to
send pages to browsers it doesn't like. Try:

Click to expand...

[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)

They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically.

http://en.wikipedia.org/wiki/Wikipedia:Bots says:

This policy in a nutshell:
Programs that update pages automatically in a useful and harmless way
may be welcome if their owners seek approval first and go to great
lengths to stop them running amok or being a drain on resources.

On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
even provide a link to some frameworks for writing bots e.g.

http://sourceforge.net/projects/pywikipediabot/

Gabriel Zachmann · Jun 13, 2006

On the other hand something which is simply retrieving one or two fixed

pages doesn't fit that definition of a bot so is probably alright. They

i think so, too.

even provide a link to some frameworks for writing bots e.g.

http://sourceforge.net/projects/pywikipediabot/

ah, that looks nice ..

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

Gabriel Zachmann · Jun 13, 2006

headers = {}

headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)

ah, thanks a lot, that works !

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

A Question on URLLIB	1	Mar 15, 2011
urllib and parsing	0	Oct 4, 2011
Iterate through a list and try log in to a website with urllib and re	8	Mar 3, 2014
Streaming pdf with URLLib	2	May 20, 2009
urllib timeout	9	Jul 27, 2010
urllib post and redirect = fail	0	Dec 10, 2009
newbie question about confusing exception handling in urllib	6	Apr 9, 2013
Can urllib check path exists on server?	1	Feb 10, 2009

urllib behaves strangely

Gabriel Zachmann

Benjamin Niemann

Benjamin Niemann

John Hicken

Duncan Booth

John J. Lee

Duncan Booth

Gabriel Zachmann

Gabriel Zachmann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads